If you’re using a commercial database or PostgreSQL / Firebird / CUBRID, you will be able to take advantage of the full power of window functions. We’ve blogged about window functions’ awesomeness a couple of times, in particular about ROW_NUMBER(), RANK(), DENSE_RANK().
Today, we’re going to look into some awesome window functions that produce values of other rows that are positioned before or after the current row.
Setting up the test data
We’re going to do some interesting statistics today using publicly available data from the World Bank. To keep things simple, we’ll only do analyses for the G8 countries:
Canada (CA)
France (FR)
Germany (DE)
Italy (IT)
Japan (JP)
Russian Federation (RU)
United Kingdom (GB)
United States (US)
And for those countries, let’s consider the following data points for the years 2009-2012:
GDP per capita (current US$)
Central government debt, total (% of GDP)
Let’s put all that data into a fact table like so (PostgreSQL syntax):
Start the querying fun
People who are used to SQL-92 syntax will be able to quickly find the highest GDP per capita or the highest debt from the table. It’s an easy query like this one:
Which will return:
But that’s not interesting. We don’t even know what countries and what years these values are associated with.
A standard SQL-92 (and also a standard relational) query to return all of these values would look something like this:
In essence, we select those rows for which there doesn’t exist any other row with a higher value for either gdp_per_capita (first subselect) or govt_debt (second subselect).
Trick! Use quantified comparison predicates!
If your database supports quantified comparison predicates, then you can write this a bit more concisely like this:
Which is essentially the same as…
The output would be:
That’s a lot of SQL for only little analysis capability, and somehow, it just doesn’t feel entirely right to query the same table four times with all these subselects!
FIRST_VALUE() and LAST_VALUE()
This is where window functions come into play, and in this particular case, FIRST_VALUE() or LAST_VALUE(). For now, let’s focus on calculating the maximum GDP per capita from the data set:
Notice how we make use of the SQL standard WINDOW clause, which is only currently supported by PostgreSQL and Sybase SQL Anywhere.
If you’re using Oracle or any other commercial database, you can simply substitute the window reference w_gdp into the various OVER() clauses to achieve equivalent behaviour – or you can use jOOQ’s WINDOW clause support and let jOOQ do the same for you.
The above query will not produce any aggregates, but it will add the values for the country / year with the highest GDP per capita to every row in the table:
This is extremely interesting because the data is not yet aggregated – the original data set remains unchanged, enriched with new computed columns.
You can then further process things, e.g. compare each country / year with the highest GDP per capita and with the highest debt per GDP of that country / year:
Notice how I’ve added PARTITION BY to the window definitions of the WINDOW clause. I’ve done this because I want to partition the data set by year, in order to find the highest GDP / debt values for each year, not for the whole data set.
The outcome of the above query can then be seen here:
We could say that among the G8 countries, Canada has really improved the most in the last years, decreasing their debt compared to the GDP on a global comparison, while at the same time increasing their GDP per capita on a global comparison.
Instead of partitioning the data set by year, we could also partition it by country, and find the best / worst year for each country over the years:
The result would now look quite different:
As you can see, most countries have now generally performed better in terms of GDP per capita over the years, and also most countries have almost strictly increased their own debt per GDP (except for Germany, France and Italy), except for the (United Kingdom). Russia and Canada have seen the most growth.
In the above examples, we’ve been mainly using FIRST_VALUE(). LAST_VALUE() is almost the opposite function with respect to ordering, much like MAX() is the opposite function of MIN(). I’m saying almost because there is a caveat when using LAST_VALUE() with ORDER BY, because a window definition that uses ORDER BY is implicitly equivalent to a window definition that uses ORDER BY with a so-called “frame clause”:
LEAD() and LAG()
The previous functions were about comparing values with the maximum / minimum (FIRST_VALUE() and LAST_VALUE()) within a data set. But using window functions, you can also compare things with the next / previous value. Or with the second next / second previous, etc. The functions used for this are called LEAD() (for the next value) and LAG() (for the previous value).
This is best explained by example:
The result is now simply:
LEAD() and LAG() are really the best window functions to help understand the whole concept of window functions. For each year, you can see immediately how the previous and next year in the same window and frame can be generated using very simple function calls.
This could be used, for instance, to find the “neighboring” countries in terms of GDP per capita for every country / year:
Which returns:
If you want to do more fancy analyses, you could now compare percentages between leaders and runner-ups, etc. Another great use-case for LEAD() and LAG() can be seen in this article.
Conclusion
Window functions are an incredibly powerful feature that is available from all major commercial databases, and also from a couple of Open Source databases like PostgreSQL, Firebird, and CUBRID. There has essentially been SQL before window functions, and SQL after window functions.
With jOOQ, you can leverage window functions on a type safe level like anything else related to SQL. The last query we’ve seen can be written simply like this:
The above program will output
No matter whether you’re using jOOQ for your database integration, or just plain SQL – start using window functions today.
Liked this article?
Read more about how ROW_NUMBER(), RANK(), and DENSE_RANK() work.
Filed under: sql Tagged: FIRST_VALUE, java, jooq, LAG(), LAST_VALUE, LEAD(), sql, Window Functions