Clustered Indexes

In this article I explain what is meant by a clustered index and compare it with a non-clustered index.


A clustered index on a table determines the order in which the rows of the table are stored on disk. If the table has a clustered index, the rows of the table are stored in the same order as that of the clustered index. As an illustration, suppose we have a Customer table that contains the following columns:

  • customerID
  • firstName
  • lastName

where customerID is the primary key on the table. If we define the clustered index on customerID, then the rows of the table will be stored in sorted order according to customerID. This means that rows with customerID=1000, customerID=1001, etc will be adjacent to each other on disk.

Advantages and Disadvantages

The advantages of having a clustered index include the following:

  1. Range queries involving the clustered index will be faster since once the row with 1st key value is located the remaining rows physically stored next to each other and no more searching is needed.
  2. The leaf nodes of the B-tree that make up the clustered index contain the actual data pages as opposed to a non-clustered index where the leaf nodes contain pointers to rows on data pages. Hence there is 1 less level of indirection for a clustered index and this improves performance.

The disadvantages of a clustered index include:

  1. Updates involving columns used in the clustered index result in a performance hit since the rows may have to be re-arranged to keep the table in sorted order in line with the clustered index. In light of this, it is recommended that a clustered index is created on a primary or foreign key, since this would be less prone to updates.

Comparison of clustered vs non-clustered indexes

  • Returning to our Customer table example, if we have a clustered index on Customer id, then the leaf node of the clustered index will contain the actual row data (data pages) for a particular customerID while for a non-clustered index the value of customerID and a pointer to the actual row is what is stored at the leaf node.
  • There can only be 1 clustered index per table, but multiple non-clustered indexes (up to 249 in the case of Sybase). This is because the rows in the table are physically ordered according to the clustered index and there is only 1 way of doing so.
  • A clustered index determines the order in which the rows of the table can be stored on disk, but this is not the case for a non-clustered index
  • A clustered index can be a performance hit in the case of updates to an indexed column since the rows may have to be re-ordered in order to maintain order. Range queries and queries involving indexed foreign keys tend to show better performance for a clustered vs non-clustered index.


For a more in depth look at clustered indexes, see the following very good article by Michelle Ufford : 
Effective Clustered Indexes

Removing duplicates from a table with no primary keys

Consider the following table:

mysql> SELECT * FROM Philosopher;
| philosopherID | firstName | lastName |
| 1234 | John | Locke |
| 1234 | John | Locke |
| 2345 | Rene | Descartes |
| 2347 | John Stuart | Mill |
| 1562 | Emmanuel | Kant |
| 1562 | Emmanuel | Kant |
| 1671 | Baruch | Spinoza |
| 1562 | Emmanuel | Kant |
| 1761 | Jean-Paul | Sartre |
9 rows in set (0.00 sec)

Come up with a strategy to first identify and then remove the duplicate rows.

What SQL queries would you use ?


1. Identify duplicate rows:

SELECT * FROM Philosopher
 GROUP BY philosopherID, firstName, lastName HAVING COUNT(*) > 1;

2. Remove duplicate rows

i. Create temporary table and copy data over:

 philosopherID INT,
 firstName VARCHAR(50),
 lastName varchar(50)

SELECT philosopherID,firstName, lastName FROM Philosopher;

ii. Truncate original table and do SELECT DISTINCT:

INSERT INTO Philosopher(philosopherID, firstName, lastName)
SELECT DISTINCT philosopherID, firstName, lastName 
FROM PhilosopherDups;

iii. Another option to using DISTINCT would be to use the rowId in the temptable to delete duplicate rows, truncate the original table and copy data back:

DELETE FROM PhilosopherDups
SELECT MAX(rowId) FROM PhilosopherDups  
GROUP BY philosopherID, firstName, lastName 


INSERT INTO Philosopher(philosopherID, firstName, lastName)
SELECT philosopherID, firstName, lastName 
FROM PhilosopherDups;

Pass-by-value vs Pass-by-reference in Java and C++

In this article I illustrate what it means to pass-by-value as opposed to pass-by-reference with a focus on Java vs C++.

The question often asked is this : Is Java pass-by-reference ?
A common and often erroneous answer is : Java is pass by reference for objects, and pass-by-value for primitives.
This is WRONG. To illustrate why this is so, let me refer you to this quote by the father
of Java himself, James Gosling:

Some people will say incorrectly that objects are passed “by reference.” In programming language design, the term pass by reference properly means that when an argument is passed to a function, the invoked function gets a reference to the original value, not a copy of its value. If the function modifies its parameter, the value in the calling code will be changed because the argument and parameter use the same slot in memory…. The Java programming language does not pass objects by reference; it passes object references by value. Because two copies of the same reference refer to the same actual object, changes made through one reference variable are visible through the other. There is exactly one parameter passing mode — pass by value — and that helps keep things simple.

— James Gosling, et al., The Java Programming Language, 4th Edition

The above clearly states that Java passes object references by value meaning that when the reference is passed, a copy of that reference (which is an address) is passed. Since the copy of the reference and the reference refer to the same object, if a call is made to a method that modifies the object in Java, that object is modified, hence the line “Because two copies of the same reference refer to the same actual object, changes made through one reference variable are visible through the other”.

I will now illustrate what pass-by-reference means, via a clear example in C++.

Let us create the following files in an appropriate directory with the following contents:

#ifndef PassByReference_hpp
#define PassByReference_hpp
void swapIntByRef(int& iParam, int& jParam);
void swapIntByVal(int iParam, int jParam);

#include <iostream>
#include "PassByReference.hpp"
using namespace std;

int main()
int i=1000;
int j=2300;

cout << "Illustration of Pass By Reference:\n";
cout << "Before: i= " << i << " j=" << j;
cout << "\n";
cout << "After: i= " << i << " j=" << j;
cout << "\n";

cout << "\nIllustration of Pass By Value:\n";


cout << "Before: i= " << i << " j=" << j;
cout << "\n";
cout << "After: i= " << i << " j=" << j;
cout << "\n";


void swapIntByRef(int& iParam, int& jParam)
int temp(iParam);

void swapIntByVal(int iParam, int jParam)
int temp(iParam);

We now compile and run the code (assuming you have the g++ compiler):

g++ -o PassByReference PassByReference.cpp
Illustration of Pass By Reference:
Before: i=1000 j=2300
After: i=2300 j=1000

Illustration of Pass By Value:
Before: i=1100 j=2500
After: i=1100 j=2500

The results above perfectly illustrate the difference between passing by reference vas pass-by-value, at least from the C++ point of view.
By using the reference operator &, when the value of i is passed to the swapIntByRef function, the actual parameter value is modified in the function such that when the function returns back to the main() function that calls it and the values of i and j are printed out, the values of i and j have been swapped.

In the latter case of pass-by-value, copies of i and j are passed, not references via the & operator.
The result of this is that even though an attempt is made to swap the values in the swapIntByVal function, the original actual parameter values remain unchanged, and this is what we see in the result.

The latter case is what prevails in Java even for all cases, even in the case of objects.

Here is an illustration in Java for both primitives and object references:

Create the file

public class PassByValueDemo {

public static void main(String[] args) {
int i=1000;
int j=2300;
System.out.println("Primitives Case");
System.out.println(" Before: i=" + i + " j=" + j);

System.out.println(" After: i=" + i + " j=" + j + "\n");

System.out.println("Wrapper Case");
Integer iw=1000;
Integer jw=2300;
System.out.println(" Before: iw=" + iw + " jw=" + jw);

System.out.println(" After: iw=" + iw + " jw=" + jw);


static void swapInt(int iParam, int jParam)
int temp=jParam;
System.out.println(" iParam=" + iParam + " jParam=" + jParam);


static void swapInteger(Integer iParam, Integer jParam)
Integer temp=jParam;
System.out.println(" iParam=" + iParam + " jParam=" + jParam);


We now compile and run the code:


java PassByValueDemo

which produces:

Primitives Case
Before: i=1000 j=2300
iParam=2300 jParam=1000
After: i=1000 j=2300

Wrapper Case
Before: iw=1000 jw=2300
iParam=2300 jParam=1000
After: iw=1000 jw=2300

Thus we can see that in both cases of primitive and wrapper classes the values of the actual parameters i and j remain unchanged in the calling routine main. There is no way to achieve the effect we observed in the C++ method PassByRef in Java where the original actual parameters are changed. The underlying object that the reference refers to can be changed via a call to a modifying method on the referenced object, but the reference parameter is always a copy of the original actual parameter.


  • C++ supports pass-by-value and pass by reference via its & operator.

  • Java supports pass-by-value ONLY. What is erroneously thought of as pass-by-reference is really pass-by-value of an object reference.

Future of 21st Century Databases meetup

Some takeaways from Future of 21st Century Databases meetup hosted by AppNexus in NYC and a roundtable featuring NoSQL db heavyweights :

  • Eliot Horowitz, CTO and Co-Founder, 10gen / MongoDB
  • Barry Morris, Founder and CEO, NuoDB
  • Bob Wiederhold, President and CEO, Couchbase

SQL interface but NOSQL underneath
Doesn’t have a document model as yet
The How to elastically scale SQL problem has been solved
NuoDB isn’t open source yet.
Largest known installation: 400 nodes w/ sharding

Many travel sites use CouchDB – e.g. Orbitz migrated its cache from Oracle Coherence
to Couchbase. Full presentation here.
Couchbase doesn’t consider NuoDB a competitor since NuoDB isn’t really document database.
Couchbase DevDays – ways to build up skills
Largest known installation: 80 nodes w/ no application sharding
SQL databases will still be around – but growth area is NOSQL

Mongo has document features more than Couchbase
Largest known installation: 100 nodes w/ no application sharding

Installing Hortonworks Hadoop on Amazon EC2

After viewing this youtube video and following article How to Hadoop , I was able to successfully install Hortonworks Hadoop on a 4-node cluster on Amazon EC2. However I needed to make the following additions below to make it work properly:

1. Make sure that you install postgresql-8.4, not postgresql-9.1

If you did yum install postgresql, it may have installed postgresql-9.1
if that is the case, then you need to erase it:

yum erase postgresql

Download the 8.4 version
curl -O

yum install postgresql84
yum install postgresql84-server

2. edit the file /usr/sbin/ as follows:


PG_HBA_DIR = "/var/lib/pgsql/data/"


PG_HBA_DIR = "/var/lib/pgsql/8.4/data/"

3. Make sure that the bindir for postgresql is added to the PATH

export PATH=$PATH:/usr/pgsql-8.4/bin

4. Make sure that the “ambari-server” user is created and add the following

Otherwise, you will obtain the following error:
internal exception: org.postgresql.util.psqlexception: fatal: ident authentication failed for user "ambari-server"

Also, edit the /var/lib/pgsql/8.4/data/pg_hba.conf file
and add the following line:

host    all         all          md5
solution referenced from :

MapReduce and Hadoop – an Introduction

The Data Problem

The advent of the Web 2.0 age has enabled the collection of massive amounts of data from web logs, GPS tracking data, sensor data, twitter feeds, mobile phone logs, etc etc.

The nature of the data problem we face is not the collection of it, but rather how to process such huge amounts of data (in petabytes) often on a daily basis so as to obtain meaningful intelligence in order to make decisions on it. The data processing problem has coincided with the gradual erosion of Moore’s law in which the limits of processing power has been gradually reached, while at the same time the cost of hardware has kept falling.

Enter MapReduce

MapReduce is a programming paradigm used to process large datasets by parallelizing and distributing the work among independent processes or tasks. It is derived from a functional programming model. MapReduce was proposed as an algorithm to help address the data processing problem by making use of the latest trends in hardware computing: cheap hardware and distributed computing.
The basic idea is to have cheap commodity machines arranged in clusters running independent tasks among which large data sets could be distributed and processed.

The advantages of MapReduce are as follows.
It provides:

  • Clean abstraction for developers
  • Fault tolerance and recovery
  • Monitoring and Status updates
  • Automatic parallelization and Distribution
  • Input-Output Scheduling

Basic Idea

It consists of 2 main functions :
A list of key-value pairs is provided to the map function, which applies a function f to each element of the list, and returns a list of intermediate key-value pairs.
Reduce receives as input the list of intermediate key-value pairs produced by map and combines these together to produce the final output.
All values corr. to the same key will be processed at the same reducer node.

Applications of Map Reduce

  • Inverted Index Construction
  • Document Clustering
  • Web link-graph traversal
  • Statistical Machine Translation
  • Machine Learning


Hadoop provides a open-source implementation of MapReduce.
It is a highly scalable distributed computing framework which enables distributed processing of huge data sets across a cluster of machines via MapReduce
It provides failure detection & failover at the application layer as well as high availability.

Hadoop Environment Setup

The key features of a Hadoop setup are as follows:

provides data storage services for the shared file system
Manages metadata of cluster and DataNodes
Secondary NameNode
Backup for NameNode
Manages job & resources in cluster
Manages tasks
Does cluster balancing

Data Science Reading List

Based on a discussion on the LinkedIn Data Science group and my experience, here’s are some articles/books that should be included in a good Data Science reading list for programmers and developers, with many available on the Safari O’Reilly website:


Big Data and Hadoop

Data Mining, Analysis and Statistics

Machine Learning

Data Visualization

Data Science, Big Data and MapReduce

To start one’s journey into the world of Data Science and Analytics, it behooves one to gain an understanding of the most touted terms in this field. The paper Data Science and Distributed Intelligence provides a good explanation of the often used terms Big Data, Data Science and Map Reduce. Here is my synopsis of these terms based on a reading of this paper:

Big Data refers to the proliferation of large amounts of data in huge databases and very high rate streaming data that produces it. Examples are : web server logs due to user clicks, cell phone logs, twitter feeds, facebook comments etc. An interesting fact is that 90% of Big Data was produced only in the last 2 years.

Data Science is a term that describes the process of gathering data, analyzing it and obtaining information out of it to produce a “data product”. Such data may involve large unstructured data sets. Hence the term ‘Data Science’ is often used in conjunction with Big Data.Since the processing techniques for Big Data often involve functional programming paradigms map and reduce, this has given rise to the MapReduce algorithm, popularized by Google.

MapReduce is an algorithm that helps mitigate the problem of processing Big Data and large data sets by breaking down the processing into smaller more manageable chunks.
Hadoop is the popular open source implementation of MapReduce