Top N Queries in SQL

One query that is one often encounters is to answer the question, given data in a database table, how do we obtain the TOP N based on some column of the table ?
There are a few ways to do this, but I will present the most popular 2 methods.

I will illustrate using the following table, emp:

SELECT * FROM emp;
+-------+--------+-----------+------+------------+---------+---------+--------+
| empno | ename | job | mgr | hiredate | sal | comm | deptno |
+-------+--------+-----------+------+------------+---------+---------+--------+
| 7369 | SMITH | CLERK | 7902 | 1980-12-17 | 800.00 | NULL | 20 |
| 7499 | ALLEN | SALESMAN | 7698 | 1981-02-20 | 1600.00 | 300.00 | 30 |
| 7521 | WARD | SALESMAN | 7698 | 1981-02-22 | 1250.00 | 500.00 | 30 |
| 7566 | JONES | MANAGER | 7839 | 1981-04-02 | 2975.00 | NULL | 20 |
| 7654 | MARTIN | SALESMAN | 7698 | 1981-09-28 | 1250.00 | 1400.00 | 30 |
| 7698 | BLAKE | MANAGER | 7839 | 1981-05-01 | 2850.00 | NULL | 30 |
| 7782 | CLARK | MANAGER | 7839 | 1981-06-09 | 2450.00 | NULL | 10 |
| 7788 | SCOTT | ANALYST | 7566 | 1982-12-09 | 3000.00 | NULL | 20 |
| 7839 | KING | PRESIDENT | NULL | 1981-11-17 | 5000.00 | NULL | 10 |
| 7844 | TURNER | SALESMAN | 7698 | 1981-09-08 | 1500.00 | 0.00 | 30 |
| 7876 | ADAMS | CLERK | 7788 | 1983-01-12 | 1100.00 | NULL | 20 |
| 7900 | JAMES | CLERK | 7698 | 1981-12-03 | 950.00 | NULL | 30 |
| 7902 | FORD | ANALYST | 7566 | 1981-12-03 | 3000.00 | NULL | 20 |
| 7934 | MILLER | CLERK | 7782 | 1982-01-23 | 1300.00 | NULL | 10 |
+-------+--------+-----------+------+------------+---------+---------+--------+
14 rows in set (0.00 sec)

Suppose we wish to obtain the top 5 employees by salary, how can one do this ?

  1. Using the RANK() function.
    In databases which provide the RANK function, we can obtain the top 5 employees
    by salary using the following query:SELECT empno, ename, sal
    FROM (SELECT empno, ename, sal, RANK() OVER (ORDER BY sal DESC) sal_rank
    FROM emp)
    WHERE sal_rank <= 5;
    +-------+-------+---------+
    | empno | ename | sal |
    +-------+-------+---------+
    | 7839 | KING | 5000.00 |
    | 7788 | SCOTT | 3000.00 |
    | 7902 | FORD | 3000.00 |
    | 7566 | JONES | 2975.00 |
    | 7698 | BLAKE | 2850.00 |
    +-------+-------+---------+
    5 rows returned in 0.02 seconds
  2. Using an expression that limits the number of rows returned from an ordered SQL result set.

The list of expressions for the databases are shown below:

 

Database Expression
Oracle ROWCOUNT
MySQL/PostgreSQL/Vertica LIMIT
Sybase ROWNUM
MS SQL TOP

ii. Here are the corr. queries for each database:

Vertica/PostgreSQL/MySQL
SELECT empno, ename, sal FROM (SELECT empno, ename, sal ORDER BY sal DESC) a
LIMIT 5;

Oracle
SELECT * FROM (SELECT empno, ename, sal FROM emp ORDER BY sal DESC) A WHERE ROWNUM <= 5;
Sybase
SET ROWCOUNT 10;
SELECT empno, ename, sal FROM emp ORDER BY sal DESC;

MS SQL
SELECT TOP 5 FROM (SELECT empno, ename, sal FROM emp ORDER BY sal DESC);

Clustered Indexes

In this article I explain what is meant by a clustered index and compare it with a non-clustered index.

Definition

A clustered index on a table determines the order in which the rows of the table are stored on disk. If the table has a clustered index, the rows of the table are stored in the same order as that of the clustered index. As an illustration, suppose we have a Customer table that contains the following columns:

  • customerID
  • firstName
  • lastName

where customerID is the primary key on the table. If we define the clustered index on customerID, then the rows of the table will be stored in sorted order according to customerID. This means that rows with customerID=1000, customerID=1001, etc will be adjacent to each other on disk.

Advantages and Disadvantages

The advantages of having a clustered index include the following:

  1. Range queries involving the clustered index will be faster since once the row with 1st key value is located the remaining rows physically stored next to each other and no more searching is needed.
  2. The leaf nodes of the B-tree that make up the clustered index contain the actual data pages as opposed to a non-clustered index where the leaf nodes contain pointers to rows on data pages. Hence there is 1 less level of indirection for a clustered index and this improves performance.

The disadvantages of a clustered index include:

  1. Updates involving columns used in the clustered index result in a performance hit since the rows may have to be re-arranged to keep the table in sorted order in line with the clustered index. In light of this, it is recommended that a clustered index is created on a primary or foreign key, since this would be less prone to updates.

Comparison of clustered vs non-clustered indexes

  • Returning to our Customer table example, if we have a clustered index on Customer id, then the leaf node of the clustered index will contain the actual row data (data pages) for a particular customerID while for a non-clustered index the value of customerID and a pointer to the actual row is what is stored at the leaf node.
  • There can only be 1 clustered index per table, but multiple non-clustered indexes (up to 249 in the case of Sybase). This is because the rows in the table are physically ordered according to the clustered index and there is only 1 way of doing so.
  • A clustered index determines the order in which the rows of the table can be stored on disk, but this is not the case for a non-clustered index
  • A clustered index can be a performance hit in the case of updates to an indexed column since the rows may have to be re-ordered in order to maintain order. Range queries and queries involving indexed foreign keys tend to show better performance for a clustered vs non-clustered index.

References

For a more in depth look at clustered indexes, see the following very good article by Michelle Ufford : 
Effective Clustered Indexes

Removing duplicates from a table with no primary keys

Consider the following table:

mysql> SELECT * FROM Philosopher;
+---------------+-------------+-----------+
| philosopherID | firstName | lastName |
+---------------+-------------+-----------+
| 1234 | John | Locke |
| 1234 | John | Locke |
| 2345 | Rene | Descartes |
| 2347 | John Stuart | Mill |
| 1562 | Emmanuel | Kant |
| 1562 | Emmanuel | Kant |
| 1671 | Baruch | Spinoza |
| 1562 | Emmanuel | Kant |
| 1761 | Jean-Paul | Sartre |
+---------------+-------------+-----------+
9 rows in set (0.00 sec)

Come up with a strategy to first identify and then remove the duplicate rows.

What SQL queries would you use ?

Solution

1. Identify duplicate rows:

SELECT * FROM Philosopher
 GROUP BY philosopherID, firstName, lastName HAVING COUNT(*) > 1;

2. Remove duplicate rows

i. Create temporary table and copy data over:

CREATE TEMPORARY TABLE PhilosopherDups
(
 rowId INTEGER NOT NULL AUTO_INCREMENT PRIMARY KEY,
 philosopherID INT,
 firstName VARCHAR(50),
 lastName varchar(50)
)

SELECT philosopherID,firstName, lastName FROM Philosopher;

ii. Truncate original table and do SELECT DISTINCT:

TRUNCATE TABLE Philosopher;
INSERT INTO Philosopher(philosopherID, firstName, lastName)
SELECT DISTINCT philosopherID, firstName, lastName 
FROM PhilosopherDups;

iii. Another option to using DISTINCT would be to use the rowId in the temptable to delete duplicate rows, truncate the original table and copy data back:

DELETE FROM PhilosopherDups
WHERE rowID IN 
SELECT MAX(rowId) FROM PhilosopherDups  
GROUP BY philosopherID, firstName, lastName 
HAVING COUNT(*) > 1;

TRUNCATE TABLE Philosopher;

INSERT INTO Philosopher(philosopherID, firstName, lastName)
SELECT philosopherID, firstName, lastName 
FROM PhilosopherDups;

Future of 21st Century Databases meetup

Some takeaways from Future of 21st Century Databases meetup hosted by AppNexus in NYC and a roundtable featuring NoSQL db heavyweights :

  • Eliot Horowitz, CTO and Co-Founder, 10gen / MongoDB
  • Barry Morris, Founder and CEO, NuoDB
  • Bob Wiederhold, President and CEO, Couchbase

NuoDB
SQL interface but NOSQL underneath
Doesn’t have a document model as yet
The How to elastically scale SQL problem has been solved
NuoDB isn’t open source yet.
Largest known installation: 400 nodes w/ sharding

CouchBase
Many travel sites use CouchDB – e.g. Orbitz migrated its cache from Oracle Coherence
to Couchbase. Full presentation here.
Couchbase doesn’t consider NuoDB a competitor since NuoDB isn’t really document database.
Couchbase DevDays – ways to build up skills
Largest known installation: 80 nodes w/ no application sharding
SQL databases will still be around – but growth area is NOSQL

MongoDB
Mongo has document features more than Couchbase
Largest known installation: 100 nodes w/ no application sharding