Breadth-First Search


Given a graph G=(V,E) and a source vertex s, BFS systematically explores the edges of E to disover every vertex that is reachable from s.
It computes the distance (smallest no. of edges) from s to each reachable vertex. It produces a BFS tree with root s that contains all reachable vertices.
For any vertex v reachable from s, the path in the BFS tree from s to v corr. to a shortest path from s to v in G, i.e. path containing the smallest number of edges. BFS discovers all vertices at distance k from s before discovering any vertices at distance k+1.


The pseudocode below assumes that the input graph G = (V,E) is represented using adjacency lists. It maintains several additional data fields with each vertex in the graph.
The color of each vertex u in V is stored in color[u], and the predecessor of u is stored in prev[u]. If u has no predecessor, then prev[u] = NULL. The distance from the source s to vertex u is stored in dist[u] . The algorithm uses a FIFO queue Q to manage the set of gray vertices.


1   for each vertex u in V[G] - {s}
2     do color[u] <-- WHITE
3         dist[u] <-- INF
4         prev[u] <-- NULL
5   color[s] <-- GRAY
6   dist[s] <-- 0
7   prev[s] <-- NULL
8   Q <-- {}
9   ENQUEUE(Q, s)
10 while Q!={}
11    do u <-- DEQUEUE(Q)
12      foreach v in Adj[u]
13         do if color[v]=WHITE
14              then color[v] <-- GRAY
15                  dist[v] <-- d[u] +1
16                  prev[v] <-- u
17                  ENQUEUE(Q,v)
18      color[u] <-- BLACK


Total running time is: O(V+E). BFS runs linear in the size of the adjacency list representation of the graph G.

Source code:

Reference :  Introduction to Algorithms by Corman, Leiserson & Rivest (MIT Press)

Binary Search Trees


A binary search tree is a binary tree whose keys satisfy the binary search tree property:
Let x be a node in the BST.
If y is a node in the left subtree of x, then key[y] <= key[x].
If y is a node in the right subtree of x then key[y] >= key[x].

Basic operations on a binary search tree (BST) take time proportional to the height of the tree. For a complete binary tree with n nodes, such operations run in
O(lg n) worst-case time.
If the tree is a linear chain of n nodes, the same operations take O(n) worst case time. The expected height of a randomly built BST is O(lg n), so that basic dynamic set operations on such a tree take θ(lg n) time on average. There are variations of BSTs whose worst-case performance can be guaranteed to be good – red-black and B-trees are 2 such examples.

Basic operations on BSTs include:

The BST property allows us to print out all the keys in 3 different ways:

  1. InOrder Tree Walk – the key of the root of a subtree is printed between the values in its left subtree and those in its right subtree – LPR
  2. PreOrder Tree walk – root is printed before values in either subtree
  3. PostOrder Treee walk – root is printed after values in its subtrees.

It takes θ(n) time to walk an n -node binary tree.

Querying a binary search tree


We wishto search for a node with a given key in a BST. Given a pointer to the root of the tree and a key k, TREE_SEARCH returns a pointer to a node with key k if one exists, else it returns NULL. The search begins at the root and moves downward in the tree. For each node x it encounters, it compares the key k with key[x]. if the 2 keys are equal, the search terminates. If k < key[x] the search continues in the left subtree, else it continues in the right subtree. The running time of TREE_SEARCH is O(h), where h is the height of the tree.

1 if x==NULL or k=key[x]
2    then return x
3 if k < key[x]
4    then return TREE_SEARCH(left[x],k)
5    else return TREE_SEARCH(right[x],k)

1 while x!=NULL && k != key[x]
2   do if k < key[x]
3      then x <-- left[x]
4      else x <-- right[x]
5 return x

Maximum and Minimum

A minimum can always be found by following left child pointers down from the rroot until a NULL is reached :

1 while left[x] != NULL
2  do x <--- left[x]
3 return x

The pseudo-code for TREE_MAXIMUM is symmetric:

1 while right[x] != NULL
2  do x <--- right[x]
3 return x

Both of these procedures run in O(h) time.

SuCcessor and predecessoR

It is sometimes important to be able to find the successor to a node in the sorted order determined by an inorder tree walk. There are 2 cases:

  1. If the right subtree of node x is nonempty, then the successor of x is the leftmost node in the right subtree, given by TREE_MINIMUM(right[x]).
  2. If the right subtree of x is empty, and x has a successor y, then y is the lowest ancestor of x whose left child is also an ancestor of x.
1 if right[x] != NULL
2   then return TREE_MINIMUM(right[x])
3 y <-- p[x]
4 while y!=NULL && x=right[y]
5      do x <--- y
6         y <--- p[y]
7 return y

TREE_PREDECESSOR is symmetric to TREE_SUCCESSOR and both run in time O(h),

Insertion and Deletion


The TREE_INSERT procedure works as follows – walk down the tree from the root, choosing the left or right child until we find an appropriate leaf node from which to hang the new node. Hence if our new node is z, for each node x if key[z] < key [x], we know that z must hang somewhere off the left subtree of x, and set x as left[x] and as right[x] if NOT.
When we reach a leaf node we have found the node y to hang off and we assign this node y as parent[z]. We then determine if key[z] < key[y]. If so, then z becomes left[y] and if not, it becomes right[y].

1 y <--- NULL
2 x <--- root[T]
3 while x !=NULL
4  do y <-- x
5     if key[z] < key[x]
6     then x <--- left[x]
7     else x <--- right[x]
8 parent[z] <--- y        //At this point y=parent[x] from loop above
9 if y == NULL
10   then root[T] <--- z
11   else if key[z] < key[y]
12        then left[y] <--- z
13        else right[y] <--- z


There are 3 cases to consider when deleting a node z from a BST.

  1. Node z is a leaf node and has no children In this case we modify parent[z] to replace z as its child with NULL.
  2. Node z has only 1 child. In this case we “splice out” z by making a new link between child[z] and parent[z].
  3. Node z has 2 children. In this case we splice out z ‘s successor y, which has no left child (else it wouldn’t be the successor) and replace z ‘s key and satellite data with y ‘s key and satellite data.
1 if left[z]==NULL or right[z]==NULL // Cases 1 && 2
2     then y <--- z                     // Cases 1 && 2
3     else y <--- TREE_SUCCESSOR(z)     // Case 3
4 if left[y] != NULL                    // Lines 4-6 set x to non-NULL child of y, or NULL
5     then x <--- left[y]               // if y has no children 
6     else x <--- right[y]
7 if x != NULL                          // Line 7-13 splice out y
8    then parent[x] <--- parent[y]
9 if parent[y]==NULL                    // y is root,splice it out and set its child as
10   then root[T] <--- x                // root
11   else if y==left[parent[y]]          // If y is the left child of its parent, 
12         then left[parent[y]]  <--- x // splice it out and replace it with it's child x
13         else right[parent[y]] <--- x // else replace the right child of y's parent  
14 if y != z                             // In lines 14-16, if the successor of z y was the
15    then key[z] <--- key[y]           // node spliced out, replace z's key and data with
16       copy y's satellite data into z // that of y
17 return y 

Both insertion and deletion procedures run in O(h) time on a tree of height h.
References :


Top N Queries in SQL

One query that is one often encounters is to answer the question, given data in a database table, how do we obtain the TOP N based on some column of the table ?
There are a few ways to do this, but I will present the most popular 2 methods.

I will illustrate using the following table, emp:

| empno | ename | job | mgr | hiredate | sal | comm | deptno |
| 7369 | SMITH | CLERK | 7902 | 1980-12-17 | 800.00 | NULL | 20 |
| 7499 | ALLEN | SALESMAN | 7698 | 1981-02-20 | 1600.00 | 300.00 | 30 |
| 7521 | WARD | SALESMAN | 7698 | 1981-02-22 | 1250.00 | 500.00 | 30 |
| 7566 | JONES | MANAGER | 7839 | 1981-04-02 | 2975.00 | NULL | 20 |
| 7654 | MARTIN | SALESMAN | 7698 | 1981-09-28 | 1250.00 | 1400.00 | 30 |
| 7698 | BLAKE | MANAGER | 7839 | 1981-05-01 | 2850.00 | NULL | 30 |
| 7782 | CLARK | MANAGER | 7839 | 1981-06-09 | 2450.00 | NULL | 10 |
| 7788 | SCOTT | ANALYST | 7566 | 1982-12-09 | 3000.00 | NULL | 20 |
| 7839 | KING | PRESIDENT | NULL | 1981-11-17 | 5000.00 | NULL | 10 |
| 7844 | TURNER | SALESMAN | 7698 | 1981-09-08 | 1500.00 | 0.00 | 30 |
| 7876 | ADAMS | CLERK | 7788 | 1983-01-12 | 1100.00 | NULL | 20 |
| 7900 | JAMES | CLERK | 7698 | 1981-12-03 | 950.00 | NULL | 30 |
| 7902 | FORD | ANALYST | 7566 | 1981-12-03 | 3000.00 | NULL | 20 |
| 7934 | MILLER | CLERK | 7782 | 1982-01-23 | 1300.00 | NULL | 10 |
14 rows in set (0.00 sec)

Suppose we wish to obtain the top 5 employees by salary, how can one do this ?

  1. Using the RANK() function.
    In databases which provide the RANK function, we can obtain the top 5 employees
    by salary using the following query:SELECT empno, ename, sal
    FROM (SELECT empno, ename, sal, RANK() OVER (ORDER BY sal DESC) sal_rank
    FROM emp)
    WHERE sal_rank <= 5;
    | empno | ename | sal |
    | 7839 | KING | 5000.00 |
    | 7788 | SCOTT | 3000.00 |
    | 7902 | FORD | 3000.00 |
    | 7566 | JONES | 2975.00 |
    | 7698 | BLAKE | 2850.00 |
    5 rows returned in 0.02 seconds
  2. Using an expression that limits the number of rows returned from an ordered SQL result set.

The list of expressions for the databases are shown below:


Database Expression
MySQL/PostgreSQL/Vertica LIMIT

ii. Here are the corr. queries for each database:

SELECT empno, ename, sal FROM (SELECT empno, ename, sal ORDER BY sal DESC) a

SELECT * FROM (SELECT empno, ename, sal FROM emp ORDER BY sal DESC) A WHERE ROWNUM <= 5;
SELECT empno, ename, sal FROM emp ORDER BY sal DESC;

SELECT TOP 5 FROM (SELECT empno, ename, sal FROM emp ORDER BY sal DESC);

Linked List Implementation

A linked list looks like this:


Note that a linked list consists of one or more nodes. Each node contains some data (in this example, item 1, item 2, etc) and a pointer. For each node other than the last one, the pointer points to the next node in the list. For the last node, the pointer is null (indicated in the example using a diagonal line). To implement linked lists in Java, we define a ListNode class, to be used to represent the individual nodes of the list.

Note that the next field of a ListNode is itself of type ListNode. That works because in Java, every non-primitive type is really a pointer; so a ListNode object is really a pointer that is either null or points to a piece of storage (allocated at runtime) that consists of two fields named data and next.

Linked List Operations

Before thinking about how to implement lists using linked lists, let’s consider some basic operations on linked lists:

  • Adding a node after a given node in the list.
  • Removing a given node from the list.

Adding a node

Assume that we are given:

  1. n, (a pointer to) a node in a list (i.e., n is a ListNode), and
  2. newdat, the data to be stored in a new node

and that the goal is to add a new node containing newdat immediately after n.
To do this we must perform the following steps:

  1. create the new node using the given data
  2.  “link it in”:
    a) make the new node’s next field point to whatever n‘s next field was pointing to
    b) make n‘s next field point to the new node.



And here’s the code:

ListNode tmp = new ListNode(newdat);    // Step 1
tmp.setNext( n.getNext() );             // Step 2(a)
n.setNext( tmp );                       // Step 2(b)

Note that it is vital to first copy the value of n’s next field into tmp’s next field (step 2(a)) before setting n’s next field to point to the new node (step 2(b)). If we set n’s next field first, we would lose our only pointer to the rest of the list after node n!

Also note that, in order to follow the steps shown in the picture above, we needed to use variable tmp to create the new node (in the picture, step 1 shows the new node just “floating” there, but that isn’t possible — we need to have some variable point to it so that we can set its next field, and so that we can set n’s next field to point to it). However, we could in fact accomplish steps 1 and 2 with a single statement that creates the new node, fills in its data and next fields, and sets n’s next field to point to the new node! Here is that amazing statement:

n.setNext(new Listnode(newdat, n.getNext()) ); // steps 1,2(a),2(b)

Now consider the worst-case running time for this add operation. Whether we use the single statement or the list of three statements, we are really doing the same thing:

  1. Using new to allocate space for a new node (start step 1).
  2. Initializing the new node’s data and next fields (finish step 1 + step 2(a)).
  3. Changing the value of n’s next field (step 2(b)).

We will assume that storage allocation via new takes constant time. Setting the values of the three fields also takes constant time, so the whole operation is a constant-time (O(1)) operation. In particular, the time required to add a new node immediately after a given node is independent of the number of nodes already in the list.

Removing a node

To remove a given node n from a linked list, we need to change the next field of the node that comes immediately before n in the list to point to whatever n’s next field was pointing to. Here’s the conceptual picture:



Note that the fact that n’s next field is still pointing to a node in the list doesn’t matter — n has been removed from the list, because it cannot be reached from L. It should be clear that in order to implement the remove operation, we first need to have a pointer to the node before node n (because that node’s next field has to be changed). The only way to get to that node is to start at the beginning of the list. We want to keep moving along the list as long as the current node’s next field is not pointing to node n. Here’s the appropriate code:

ListNode tmp = L;
while (tmp.getNext() != n) 
       tmp = tmp.getNext();  // find the node before n

Note that this kind of code (moving along a list until some condition holds) is very common. For example, similar code would be used to implement a lookup operation on a linked list (an operation that determines whether there is a node in the list that contains a given piece of data).
Note also that there is one case when the code given above will not work. When n is the very first node in the list, the picture is like this:



In this case, the test (tmp.getNext() n)= will always be false, and eventually we will “fall off the end” of the list (i.e., tmp will become null, and we will get a runtime error when we try to dereference a null pointer). We will take care of that case in a minute; first, assuming that n is not the first node in the list, here’s the code that removes n from the list:

ListNode tmp = L;
while (tmp.getNext() != n) 
       tmp = tmp.getNext();  // find the node before n 
tmp.setNext( n.getNext() );  // remove n from the linked list

How can we test whether n is the first node in the list, and what should we do in that case? If n is the first node, then L will be pointing to it, so we can test whether L == n. The following before and after pictures illustrate removing node n when it is the first node in the list:


Here’s the complete code for removing node n from a linked list, including the special case when n is the first node in the list:

 if (L == n) {
  // special case: n is the first node in the list
    L = n.getNext();
} else {
  // general case: find the node before n, then "unlink" n
    ListNode tmp = L;
    while (tmp.getNext() != n) tmp = tmp.getNext();
    tmp.setNext( n.getNext() );

List class method

Let’s discuss these 3 functions:

  1. The version of add that adds to the end of the list.
  2. The version of add that adds to a given position in the list.
  3. The constructor.

add (to end of list)
Recall that the first version of method add adds a given value to the end of the list. We have already discussed how to add a new node to a linked list following a given node. The only question is how best to handle adding a new node at the end of the list. A straightforward approach would be to traverse the list, looking for the last node (i.e., use a variable tmp as was done above in the code that looked for the node before node n). Once the last node is found, the new node can be inserted immediately after it.
The disadvantage of this approach is that it requires O(N) time to add a node to the end of a list with N items. An alternative is to add a lastNode field (often called a tail pointer) to the List class, and to implement the methods that modify the linked list so that lastNode always points to the last node in the linked list (which will be the header node if the list is empty). There is more opportunity for error (since several methods will need to ensure that the lastNode field is kept up to date), but the use of the lastNode field will mean that the worst-case running time for this version of add is always O(1).

add (at a given position)
As discussed above for the “add to the end” method, we already know how to add a node to a linked list after a given node. So to add a node at position pos, we just need to find the previous node in the list.

Here’s a picture of the “ant, bat, cat” list, when the implementation includes a lastNode pointer:



The List constructor
The List constructor needs to initialize the three List fields:

  1. Listnode items (the pointer to the header node)
  2. Listnode lastNode (the pointer to the last node in the list)
  3. int numItems

so that the list is empty. An empty list is one that has just a header node, pointed to by both items and lastNode.

Linked List Variations

Here we discuss 2 variations of a linked list:

  1. doubly linked lists
  2. circular linked lists

Doubly linked lists

Recall that, given (only) a pointer to a node n in a linked list with N nodes, removing node n takes time O(N) in the worst case, because it is necessary to traverse the list looking for the node just before n. One way to fix this problem is to require two pointers: a pointer the the node to be removed, and also a pointer to the node just before that one. Another way to fix the problem is to use a doubly linked list.

Here’s the conceptual picture:


Each node in a doubly linked list contains three fields: the data, and two pointers. One pointer points to the previous node in the list, and the other pointer points to the next node in the list. The previous pointer of the first node, and the next pointer of the last node are both null. Here’s the Java class definition for a doubly linked list node: DoubleListNode

To remove a given node n from a doubly linked list, we need to change the prev field of the node to its right, and we need to change the next field of the node to its left, as illustrated below.


Here’s the code for removing node n:


// Step 1: change the prev field of the node after n
  DoubleListNode tmp = n.getNext();
  tmp.setPrev( n.getPrev() );

// Step 2: change the next field of the node before n
  tmp = n.getPrev();
  tmp.setNext( n.getNext() );

Unfortunately, this code doesn’t work (causes an attempt to dereference a null pointer) if n is either the first or the last node in the list. We can add code to test for these special cases, or we can use a circular, doubly linked list, as discussed below.

Circular linked lists

Both singly and doubly linked lists can be made circular. Here are the conceptual pictures:


The class definitions are the same as for the non-circular versions. The difference is that, instead of being null, the next field of the last node points to the first node, and (for doubly linked circular lists) the prev field of the first node points to the last node.

The code given above for removing node n from a doubly linked list will work correctly except when node n is the first node in the list. In that case, the variable L that points to the first node in the list needs to be updated, so special-case code will always be needed unless the list includes a header node.

Another issue that you must address if you use a circular linked list is that if you’re not careful, you may end up going round and round in circles! For example, what happens if you try to search for a particular value val using code like this:

ListNode tmp = L;
while (tmp != null && !tmp.getData().equals(val)) tmp = tmp.getNext();

and the value is not in the list? You will have an infinite loop!


Comparison of Linked List Variations

The major disadvantage of doubly linked lists (over singly linked lists) is that they require more space (every node has two pointer fields instead of one). Also, the code to manipulate doubly linked lists needs to maintain the prev fields as well as the next fields; the more fields that have to be maintained, the more chance there is for errors. The major advantage of doubly linked lists is that they make some operations (like the removal of a given node, or a right-to-left traversal of the list) more efficient.

The major advantage of circular lists (over non-circular lists) is that they eliminate some special-case code for some operations. Also, some applications lead naturally to circular list representations. For example, a computer network might best be modeled using a circular list.


Complete Linked List Implementation

The complete linked list implementation may be found here: ListImpl


Thread Deadlock in Java

Deadlock occurs when a group of processes blocks forever because each process is waiting for resources which are held by another process in the group. What happens is that a task is stuck waiting for another task to release a resource which itself is stuck waiting for another task to release a resource and so on such that a circular wait loop ensures. The result is that no task can proceed. Thus a deadlock results.

The classis case of deadlock is the Dining Philosopher problem.
In this problem we have, say 5 philosophers sitting down for dinner at a round table.
To the left and right of each philosopher is a chopstick and there are 5 of these chopsticks F1 – F5, illustrated below:

Dining philosopher problem


In order for each philospher to eat, they must pick up the left and right chopsticks.
Each philosopher decides to pick up the chopstick on his right 1st before picking up the one on his left.
After picking up the chopstick on the right, each philosopher attempts to pick up the chopstick on his left and if it is not yet available, has to wait.

Thus we can have the following scenario:

P1 picks up F1, waits to pick up F2
P2 picks up F2, waits to pick up F3
P3 picks up F3, waits to pick up F4
P4 picks up F4, waits to pick up F1
P5 picks up F5, waits to pick up F1

Thus we have a circular wait scenario, where each philosopher is waiting on the next philosopher to his left to drop his right chopstick and so on such that no philosopher can eat.

Here is Java code for a simpler example of deadlock involving 2 tasks:

public class DeadlockDemo
  public Integer obj1 = 1;
  public Integer obj2 = 2;

  private class Runnable1 implements Runnable
   public void run()
     System.out.println("R1: Obtained lock on obj1:" + obj1);
     System.out.println("R1: Waiting to obtain lock on obj2..."):
      System.out.println("R1: Obtained lock on obj2:" + obj2);

  private class Runnable2 implements Runnable
   public void run()
     System.out.println("R2: Obtained lock on obj2:" + obj2);
     System.out.println("R2: Waiting to obtain lock on obj1..."):
      System.out.println("R2: Obtained lock on obj1:" + obj1);

  public static void main(String[] args)
   DeadlockDemo dDemo=new DeadlockDemo();
   Runnable Runnable1();
   Runnable r2=new Runnable2();
   new Thread(r1).start();
   new Thread(r2).start();


I ran the above code and it produced the following result:

R2: Obtained lock on obj2:2
R2: Waiting to obtain lock on obj1...
R1: Obtained lock on obj1:1
R1: Waiting to obtain lock on obj2...

and the program hung with the 2 threads stuck in a circular wait.

Defining and Starting a Thread in Java

There are 3 ways to do this:

1. Subclass Thread class

i. Subclass the Thread class and override the run() method
ii. Instantiate the Thread subclass.
iii. Call Thread.start() method.

public class MyThread extends Thread {
   public void run() { 
      System.out.println("Thread via Extending Thread!"); 
   public static void main(String []args) { 
      (new MyThread()).start(); 



2. Provide a Runnable object by implementing Runnable

Runnable interface defines a single method run(), meant to contain the code executed in the thread.

i. Implement the Runnable interface by implementing run() method.
ii. Pass instance of Runnable object to the Thread(..) constructor.
iii. Call Thread.start() method.

public class MyRunnable implements Runnable {

   public void run() {
      System.out.println("Thread via Implementing Runnable!");

   public static void main(String[] args) {
      (new Thread(new MyRunnable())).start();



Notice that both cases above invoke Thread.start() in order to start the new thread. In either case above, the result is a Thread object, where the run() method is the body of the thread. When the start() method of the Thread object is called, the interpreter creates a new thread to execute the run() method. The new thread continues to run until the run() method exits. Meanwhile the original thread continues running itself, starting with the statement following the start() method.


3. Using Executor interface

The Executor interface can be used to invoke threads as well. It isn’t really a new idiom, since either a Thread object or object that implements Runnable needs to be created first. The Executor interface provides a layer of indirection between a client and the execution of a task; instead of a client executing a task directly, an intermediate object executes the task. Executors allow you to manage the execution of asynchronous tasks without having to explicitly manage the lifecycle of threads.


  1. Implement and create a new instance of Runnable.
  2. Create a concrete instance of ExecutorService by calling one of the Executors factory methods.
  3. Call Executor.execute(..) method, passing the Runnable object as argument.
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;

public class MyExecutor {

	public static void main(String[] args) {
		ExecutorService exec = Executors.newCachedThreadPool(); 
	    for(int i = 0; i < 5; i++) 
	      exec.execute(new MyRunnable()); //MyRunnable implements Runnable


Thread vs. Runnable
Which of these idioms should you use? The first idiom, which employs a Runnable object, is more general, because the Runnable object can subclass a class other than Thread. Also using Runnable enhances separation of concerns vi a composition, by separating the method of execution from the interface used to construct the Runnable
The second idiom is easier to use in simple applications, but is limited by the fact that your task class must be a descendant of Thread.

Executor Factory Methods

FactoryMethod Details
newCachedThreadPool() Creates thread pool that creates new threads as needed, but will reuse previously constructred threads when they’re available
newFixedThreadPool(..) Creates thread pool that reuses a fixed number of threads operating off a shared unbounded queue
newScheduledThreadPool(..) Creates a thread pool that can schedule commands to run after a given delay, or to execute periodically
newSingleThreadExecutor() Creates an Executor that uses a single worker thread operating off an unbounded queue
newSingleThreadScheduledExecutor() Creates a single-threaded executor that can schedule commands to run after a given delay, or to execute periodically

Abstract Data Types vs Data Structures

What is the difference between an abstract data type (ADT) and a data structure ?

The question above can be reframed in a more concrete manner by asking this question:
What is the difference between an array and a stack ?
An array is a data structure while a stack is an abstract data type.
An abstract data type (ADT) is a specification for how an ‘data’ interface should behave without any reference to its actual implementation.

The wikipedia definition of an ADT is as follows:
An abstract data type is defined as a mathematical model of the data objects that make up a data type as well as the functions that operate on these objects.
From the NIST Dictionary of Algorithms and Data structures we have the following definition of an ADT:
A set of data values and associated operations that are precisely specified independent of any particular implementation.

Thus in the case of a stack, it exhibits Last In First Out (LIFO) behavior when elements are added to and removed from it. The concrete implementation of the ADT is where the data structure comes in. Thus a Stack can be implemented as an array or as a linked list.

This might lead us to conclude that an ADT is more a theoretical or abstract concept, while the data structure has more to do with the concrete implementation.

Under the definitions above, the following constructs are ADTs:

  • Stack
  • Queue
  • Bag
  • List
  • Priority Queue
  • Trie
  • Heap
  • Binary Tree
  • Set
  • Map

while these are data structures:

  • array
  • linked list
  • hash map/dictionary

We can gain a better understanding of this when we consider these constructs in Java.
The ADT corresponds to the interface type, while the data structure would correspond to the concrete class.Thus a Java array, ArrayList, LinkedList, HashMap are actually data structures and the corresponding interfaces they implement would be equivalent to ADTs.
See this article for more about abstract data types in Java

Variable storage in Java

In order to figure out where a variable is stored in Java the most important factor is where the variable is declared.
A general rule of thumb is this:

  • local variables are stored on the stack
  • instance variables are stored on the heap
  • static variables are stored on the PermGen area of the heap

There are caveats to this however, which are explained below:

Variable Storage Details

Local variables
Primitives and object references declared in a method will be stored on the stack. However, the actual object, if created using new() will be stored on the heap, regardless of where the declaration took place. Hence in the following piece of code:
void aMethod()
int playerNum=5;
Player pl=new Player();

The primitive variable playerNum and object reference variable pl will be stored on the stack, while the actual Player object itself will live on the heap. When the code exits aMethod and goes out of scope, playerNum and pl will be popped from the stack and cease to exist but the Player object will persist on the heap until it is eventually garbage collected.
Instance variables
Instance variables, even primitives live on the heap.

Consider the code below:

public class Car {
int vinNumber;
String make;
String model;
int year;
String class;

Car(int vin, String make, String model, int year, String class)
public static void main(String[] args)
Car c=new Car(19281,"Audi", "A6",2012,"sedan");

Since an instance of Car can only be instantiated via a call to new(), we see that:

  • The Car object c lives on the heap
  • All instance primitives and objects that are part of the Car object are also stored on the heap.

Static variables
The rule for static variables is this: Static methods, primitive variables and object references are stored in the PermGen section of the heap since they are part of the reflection i.e. class, not instance related data. However, in the case of objects, the actual object itself is stored in the regular areas of the heap (young/old generation or survivor space).




Clustered Indexes

In this article I explain what is meant by a clustered index and compare it with a non-clustered index.


A clustered index on a table determines the order in which the rows of the table are stored on disk. If the table has a clustered index, the rows of the table are stored in the same order as that of the clustered index. As an illustration, suppose we have a Customer table that contains the following columns:

  • customerID
  • firstName
  • lastName

where customerID is the primary key on the table. If we define the clustered index on customerID, then the rows of the table will be stored in sorted order according to customerID. This means that rows with customerID=1000, customerID=1001, etc will be adjacent to each other on disk.

Advantages and Disadvantages

The advantages of having a clustered index include the following:

  1. Range queries involving the clustered index will be faster since once the row with 1st key value is located the remaining rows physically stored next to each other and no more searching is needed.
  2. The leaf nodes of the B-tree that make up the clustered index contain the actual data pages as opposed to a non-clustered index where the leaf nodes contain pointers to rows on data pages. Hence there is 1 less level of indirection for a clustered index and this improves performance.

The disadvantages of a clustered index include:

  1. Updates involving columns used in the clustered index result in a performance hit since the rows may have to be re-arranged to keep the table in sorted order in line with the clustered index. In light of this, it is recommended that a clustered index is created on a primary or foreign key, since this would be less prone to updates.

Comparison of clustered vs non-clustered indexes

  • Returning to our Customer table example, if we have a clustered index on Customer id, then the leaf node of the clustered index will contain the actual row data (data pages) for a particular customerID while for a non-clustered index the value of customerID and a pointer to the actual row is what is stored at the leaf node.
  • There can only be 1 clustered index per table, but multiple non-clustered indexes (up to 249 in the case of Sybase). This is because the rows in the table are physically ordered according to the clustered index and there is only 1 way of doing so.
  • A clustered index determines the order in which the rows of the table can be stored on disk, but this is not the case for a non-clustered index
  • A clustered index can be a performance hit in the case of updates to an indexed column since the rows may have to be re-ordered in order to maintain order. Range queries and queries involving indexed foreign keys tend to show better performance for a clustered vs non-clustered index.


For a more in depth look at clustered indexes, see the following very good article by Michelle Ufford : 
Effective Clustered Indexes

Removing duplicates from a table with no primary keys

Consider the following table:

mysql> SELECT * FROM Philosopher;
| philosopherID | firstName | lastName |
| 1234 | John | Locke |
| 1234 | John | Locke |
| 2345 | Rene | Descartes |
| 2347 | John Stuart | Mill |
| 1562 | Emmanuel | Kant |
| 1562 | Emmanuel | Kant |
| 1671 | Baruch | Spinoza |
| 1562 | Emmanuel | Kant |
| 1761 | Jean-Paul | Sartre |
9 rows in set (0.00 sec)

Come up with a strategy to first identify and then remove the duplicate rows.

What SQL queries would you use ?


1. Identify duplicate rows:

SELECT * FROM Philosopher
 GROUP BY philosopherID, firstName, lastName HAVING COUNT(*) > 1;

2. Remove duplicate rows

i. Create temporary table and copy data over:

 philosopherID INT,
 firstName VARCHAR(50),
 lastName varchar(50)

SELECT philosopherID,firstName, lastName FROM Philosopher;

ii. Truncate original table and do SELECT DISTINCT:

INSERT INTO Philosopher(philosopherID, firstName, lastName)
SELECT DISTINCT philosopherID, firstName, lastName 
FROM PhilosopherDups;

iii. Another option to using DISTINCT would be to use the rowId in the temptable to delete duplicate rows, truncate the original table and copy data back:

DELETE FROM PhilosopherDups
SELECT MAX(rowId) FROM PhilosopherDups  
GROUP BY philosopherID, firstName, lastName 


INSERT INTO Philosopher(philosopherID, firstName, lastName)
SELECT philosopherID, firstName, lastName 
FROM PhilosopherDups;