Top N Queries in SQL

One query that is one often encounters is to answer the question, given data in a database table, how do we obtain the TOP N based on some column of the table ?
There are a few ways to do this, but I will present the most popular 2 methods.

I will illustrate using the following table, emp:

SELECT * FROM emp;
+-------+--------+-----------+------+------------+---------+---------+--------+
| empno | ename | job | mgr | hiredate | sal | comm | deptno |
+-------+--------+-----------+------+------------+---------+---------+--------+
| 7369 | SMITH | CLERK | 7902 | 1980-12-17 | 800.00 | NULL | 20 |
| 7499 | ALLEN | SALESMAN | 7698 | 1981-02-20 | 1600.00 | 300.00 | 30 |
| 7521 | WARD | SALESMAN | 7698 | 1981-02-22 | 1250.00 | 500.00 | 30 |
| 7566 | JONES | MANAGER | 7839 | 1981-04-02 | 2975.00 | NULL | 20 |
| 7654 | MARTIN | SALESMAN | 7698 | 1981-09-28 | 1250.00 | 1400.00 | 30 |
| 7698 | BLAKE | MANAGER | 7839 | 1981-05-01 | 2850.00 | NULL | 30 |
| 7782 | CLARK | MANAGER | 7839 | 1981-06-09 | 2450.00 | NULL | 10 |
| 7788 | SCOTT | ANALYST | 7566 | 1982-12-09 | 3000.00 | NULL | 20 |
| 7839 | KING | PRESIDENT | NULL | 1981-11-17 | 5000.00 | NULL | 10 |
| 7844 | TURNER | SALESMAN | 7698 | 1981-09-08 | 1500.00 | 0.00 | 30 |
| 7876 | ADAMS | CLERK | 7788 | 1983-01-12 | 1100.00 | NULL | 20 |
| 7900 | JAMES | CLERK | 7698 | 1981-12-03 | 950.00 | NULL | 30 |
| 7902 | FORD | ANALYST | 7566 | 1981-12-03 | 3000.00 | NULL | 20 |
| 7934 | MILLER | CLERK | 7782 | 1982-01-23 | 1300.00 | NULL | 10 |
+-------+--------+-----------+------+------------+---------+---------+--------+
14 rows in set (0.00 sec)

Suppose we wish to obtain the top 5 employees by salary, how can one do this ?

  1. Using the RANK() function.
    In databases which provide the RANK function, we can obtain the top 5 employees
    by salary using the following query:SELECT empno, ename, sal
    FROM (SELECT empno, ename, sal, RANK() OVER (ORDER BY sal DESC) sal_rank
    FROM emp)
    WHERE sal_rank <= 5;
    +-------+-------+---------+
    | empno | ename | sal |
    +-------+-------+---------+
    | 7839 | KING | 5000.00 |
    | 7788 | SCOTT | 3000.00 |
    | 7902 | FORD | 3000.00 |
    | 7566 | JONES | 2975.00 |
    | 7698 | BLAKE | 2850.00 |
    +-------+-------+---------+
    5 rows returned in 0.02 seconds
  2. Using an expression that limits the number of rows returned from an ordered SQL result set.

The list of expressions for the databases are shown below:

 

Database Expression
Oracle ROWCOUNT
MySQL/PostgreSQL/Vertica LIMIT
Sybase ROWNUM
MS SQL TOP

ii. Here are the corr. queries for each database:

Vertica/PostgreSQL/MySQL
SELECT empno, ename, sal FROM (SELECT empno, ename, sal ORDER BY sal DESC) a
LIMIT 5;

Oracle
SELECT * FROM (SELECT empno, ename, sal FROM emp ORDER BY sal DESC) A WHERE ROWNUM <= 5;
Sybase
SET ROWCOUNT 10;
SELECT empno, ename, sal FROM emp ORDER BY sal DESC;

MS SQL
SELECT TOP 5 FROM (SELECT empno, ename, sal FROM emp ORDER BY sal DESC);

Linked List Implementation

A linked list looks like this:

linked_list

Note that a linked list consists of one or more nodes. Each node contains some data (in this example, item 1, item 2, etc) and a pointer. For each node other than the last one, the pointer points to the next node in the list. For the last node, the pointer is null (indicated in the example using a diagonal line). To implement linked lists in Java, we define a ListNode class, to be used to represent the individual nodes of the list.

Note that the next field of a ListNode is itself of type ListNode. That works because in Java, every non-primitive type is really a pointer; so a ListNode object is really a pointer that is either null or points to a piece of storage (allocated at runtime) that consists of two fields named data and next.

Linked List Operations

Before thinking about how to implement lists using linked lists, let’s consider some basic operations on linked lists:

  • Adding a node after a given node in the list.
  • Removing a given node from the list.

Adding a node

Assume that we are given:

  1. n, (a pointer to) a node in a list (i.e., n is a ListNode), and
  2. newdat, the data to be stored in a new node

and that the goal is to add a new node containing newdat immediately after n.
To do this we must perform the following steps:

  1. create the new node using the given data
  2.  “link it in”:
    a) make the new node’s next field point to whatever n‘s next field was pointing to
    b) make n‘s next field point to the new node.

addnode

 

And here’s the code:

ListNode tmp = new ListNode(newdat);    // Step 1
tmp.setNext( n.getNext() );             // Step 2(a)
n.setNext( tmp );                       // Step 2(b)

Note that it is vital to first copy the value of n’s next field into tmp’s next field (step 2(a)) before setting n’s next field to point to the new node (step 2(b)). If we set n’s next field first, we would lose our only pointer to the rest of the list after node n!

Also note that, in order to follow the steps shown in the picture above, we needed to use variable tmp to create the new node (in the picture, step 1 shows the new node just “floating” there, but that isn’t possible — we need to have some variable point to it so that we can set its next field, and so that we can set n’s next field to point to it). However, we could in fact accomplish steps 1 and 2 with a single statement that creates the new node, fills in its data and next fields, and sets n’s next field to point to the new node! Here is that amazing statement:

n.setNext(new Listnode(newdat, n.getNext()) ); // steps 1,2(a),2(b)

Now consider the worst-case running time for this add operation. Whether we use the single statement or the list of three statements, we are really doing the same thing:

  1. Using new to allocate space for a new node (start step 1).
  2. Initializing the new node’s data and next fields (finish step 1 + step 2(a)).
  3. Changing the value of n’s next field (step 2(b)).

We will assume that storage allocation via new takes constant time. Setting the values of the three fields also takes constant time, so the whole operation is a constant-time (O(1)) operation. In particular, the time required to add a new node immediately after a given node is independent of the number of nodes already in the list.

Removing a node

To remove a given node n from a linked list, we need to change the next field of the node that comes immediately before n in the list to point to whatever n’s next field was pointing to. Here’s the conceptual picture:

removenode

 

Note that the fact that n’s next field is still pointing to a node in the list doesn’t matter — n has been removed from the list, because it cannot be reached from L. It should be clear that in order to implement the remove operation, we first need to have a pointer to the node before node n (because that node’s next field has to be changed). The only way to get to that node is to start at the beginning of the list. We want to keep moving along the list as long as the current node’s next field is not pointing to node n. Here’s the appropriate code:

ListNode tmp = L;
while (tmp.getNext() != n) 
       tmp = tmp.getNext();  // find the node before n

Note that this kind of code (moving along a list until some condition holds) is very common. For example, similar code would be used to implement a lookup operation on a linked list (an operation that determines whether there is a node in the list that contains a given piece of data).
Note also that there is one case when the code given above will not work. When n is the very first node in the list, the picture is like this:

firstnode

 

In this case, the test (tmp.getNext() n)= will always be false, and eventually we will “fall off the end” of the list (i.e., tmp will become null, and we will get a runtime error when we try to dereference a null pointer). We will take care of that case in a minute; first, assuming that n is not the first node in the list, here’s the code that removes n from the list:

ListNode tmp = L;
while (tmp.getNext() != n) 
       tmp = tmp.getNext();  // find the node before n 
tmp.setNext( n.getNext() );  // remove n from the linked list

How can we test whether n is the first node in the list, and what should we do in that case? If n is the first node, then L will be pointing to it, so we can test whether L == n. The following before and after pictures illustrate removing node n when it is the first node in the list:

removeFirst

Here’s the complete code for removing node n from a linked list, including the special case when n is the first node in the list:

 if (L == n) {
  // special case: n is the first node in the list
    L = n.getNext();
} else {
  // general case: find the node before n, then "unlink" n
    ListNode tmp = L;
    while (tmp.getNext() != n) tmp = tmp.getNext();
    tmp.setNext( n.getNext() );
}

List class method

Let’s discuss these 3 functions:

  1. The version of add that adds to the end of the list.
  2. The version of add that adds to a given position in the list.
  3. The constructor.

add (to end of list)
Recall that the first version of method add adds a given value to the end of the list. We have already discussed how to add a new node to a linked list following a given node. The only question is how best to handle adding a new node at the end of the list. A straightforward approach would be to traverse the list, looking for the last node (i.e., use a variable tmp as was done above in the code that looked for the node before node n). Once the last node is found, the new node can be inserted immediately after it.
The disadvantage of this approach is that it requires O(N) time to add a node to the end of a list with N items. An alternative is to add a lastNode field (often called a tail pointer) to the List class, and to implement the methods that modify the linked list so that lastNode always points to the last node in the linked list (which will be the header node if the list is empty). There is more opportunity for error (since several methods will need to ensure that the lastNode field is kept up to date), but the use of the lastNode field will mean that the worst-case running time for this version of add is always O(1).

add (at a given position)
As discussed above for the “add to the end” method, we already know how to add a node to a linked list after a given node. So to add a node at position pos, we just need to find the previous node in the list.

Here’s a picture of the “ant, bat, cat” list, when the implementation includes a lastNode pointer:

lastnode

 

The List constructor
The List constructor needs to initialize the three List fields:

  1. Listnode items (the pointer to the header node)
  2. Listnode lastNode (the pointer to the last node in the list)
  3. int numItems

so that the list is empty. An empty list is one that has just a header node, pointed to by both items and lastNode.

Linked List Variations

Here we discuss 2 variations of a linked list:

  1. doubly linked lists
  2. circular linked lists

Doubly linked lists

Recall that, given (only) a pointer to a node n in a linked list with N nodes, removing node n takes time O(N) in the worst case, because it is necessary to traverse the list looking for the node just before n. One way to fix this problem is to require two pointers: a pointer the the node to be removed, and also a pointer to the node just before that one. Another way to fix the problem is to use a doubly linked list.

Here’s the conceptual picture:

doubly_linked

Each node in a doubly linked list contains three fields: the data, and two pointers. One pointer points to the previous node in the list, and the other pointer points to the next node in the list. The previous pointer of the first node, and the next pointer of the last node are both null. Here’s the Java class definition for a doubly linked list node: DoubleListNode

To remove a given node n from a doubly linked list, we need to change the prev field of the node to its right, and we need to change the next field of the node to its left, as illustrated below.

removenode_dbl

Here’s the code for removing node n:

 

// Step 1: change the prev field of the node after n
  DoubleListNode tmp = n.getNext();
  tmp.setPrev( n.getPrev() );

// Step 2: change the next field of the node before n
  tmp = n.getPrev();
  tmp.setNext( n.getNext() );

Unfortunately, this code doesn’t work (causes an attempt to dereference a null pointer) if n is either the first or the last node in the list. We can add code to test for these special cases, or we can use a circular, doubly linked list, as discussed below.

Circular linked lists

Both singly and doubly linked lists can be made circular. Here are the conceptual pictures:

circular

The class definitions are the same as for the non-circular versions. The difference is that, instead of being null, the next field of the last node points to the first node, and (for doubly linked circular lists) the prev field of the first node points to the last node.

The code given above for removing node n from a doubly linked list will work correctly except when node n is the first node in the list. In that case, the variable L that points to the first node in the list needs to be updated, so special-case code will always be needed unless the list includes a header node.

Another issue that you must address if you use a circular linked list is that if you’re not careful, you may end up going round and round in circles! For example, what happens if you try to search for a particular value val using code like this:

ListNode tmp = L;
while (tmp != null && !tmp.getData().equals(val)) tmp = tmp.getNext();

and the value is not in the list? You will have an infinite loop!

 

Comparison of Linked List Variations

The major disadvantage of doubly linked lists (over singly linked lists) is that they require more space (every node has two pointer fields instead of one). Also, the code to manipulate doubly linked lists needs to maintain the prev fields as well as the next fields; the more fields that have to be maintained, the more chance there is for errors. The major advantage of doubly linked lists is that they make some operations (like the removal of a given node, or a right-to-left traversal of the list) more efficient.

The major advantage of circular lists (over non-circular lists) is that they eliminate some special-case code for some operations. Also, some applications lead naturally to circular list representations. For example, a computer network might best be modeled using a circular list.

 

Complete Linked List Implementation

The complete linked list implementation may be found here: ListImpl
References
http://pages.cs.wisc.edu/~vernon/cs367/notes/4.LINKED-LIST.html