SQL for Data Analytics: Window Functions

The Structured Query Language or more simply SQL is famous for a rather simple reason, it is possibly the most used way to deal with large datasets spread across several tables in a relational database. Out of its many functionalities, window functions are a nice feature for data analytics. Window functions allow analyse and aggregate data in more sophisticated ways, as they provide the ability to perform calculations over a set of rows that are in some way related to the current row under consideration. Knowing window functions significantly changes the game for any professional who is heavily working with data.

Which type of tasks are best suited for window functions?

A window function computes directly on a set of rows which are related to the current row. In contrast to normal aggregate functions where one or more groups of rows will return a single answer, window functions can perform a calculation whilst leaving the actual rows intact and place the result based on a calculation extending over a “window” or a cluster of rows. This way some more elaborate analysis can be done while still keeping the level of details intact.

The idea of a window function is rooted in the definition of “windows” or groups of rows within a result set. For instance, when calculating a moving average of stock prices for the last 7 days, the “window” would encompass all the data for the last week, in this case the moving average would be calculated for every row with respect to other rows within the window.

The Basic Elements for Window Functions

Three components are the basic building blocks of window functions:

1. The Function:

Here we have a particular computation that is done. For example, window functions may be aggregates like COUNT(), thus sums or averages, but they may also include some ranking functions like NTILE() or ROW_NUMBER().

2. The OVER Clause:

This clause describes the range of rows over which the function should be applicable. The window can be ordered, partitioned, or unbounded, depending on the analysis being performed.

3. The Partition By and Order By Clauses:

These are optional clauses that narrow down the window further. The `PARTITION BY` clause breaks the dataset into small-scale pieces (partitions) prior to the application of the window function while the `ORDER BY` clause explains the sequence the rows in each partition are followed.

Types of Window Functions

Different window functions exist that aid in helping in data analysis, each designed to perform particular functions and serve a specific goal.

1. Ranking Functions

Within a Partition, all the rows are ranked and this ranking is done using the ranking functions. A good example is the `ROW_NUMBER()` function where every row is assigned a unique number starting from 1 in the case that the first row is the first in the partition. For the situation where ties exist, `RANK()` and `DENSE_RANK()` are the ranking functions used.

2. Aggregate Functions

These are types of functions which sit on the rows that exist within a given pivot and return a single result per row but don’t condense the entire result into one single row as one would with aggregate functions. The use of window functions on `SUM()`, `AVG()`, `MIN()`, and `MAX()` can be done as a means to capture total values, the average and extreme values for each partition.

3. Analytic Functions

Analytic functions serve a specific purpose where you want values that do not involve aggregation but still have a ‘window’ of rows involved. For this, functions such as `LEAD()` and `LAG()` which give access to the next or the preceding rows of the result set or `FIRST_VALUE()` and `LAST_VALUE()` which give the first and the last value in a window, respectively, are used.

4. NTILE()

The `NTILE()` function allows splitting the result set into a given number of buckets or tiles of equal numbers of rows as close as possible to equal. This function may be helpful in generating quartiles or percentiles which are typically used in data analytics for looking at the distribution of data.

Why Are Window Functions Important for Data Analytics?

Analysts are usually reluctant to use complex window functions in data analysis because they do not need to aggregate data sets. The ability to compute for example moving averages or ranks of rows based on certain conditions without having to aggregate data is very powerful in terms of insights it gives.

1. More Flexible Analysis

Instead of losing context about a particular dimension, analysts are able to perform multiple calculations against the same dimension in the context of window functions. For example, you can compute a column that contains the total of sales per each row without losing the sales of each transaction.

2. Efficient Calculations

Window functions remove the need to perform computations on aggregate summaries, hence reevaluation and aggregation of the rows is performed at the data set level. This was shown to be more efficient than the statistical techniques, especially when large sets of data were involved.

3. Improved Reporting

Window functions are extremely useful when reportability is the focus. Be it calculating total sales ever made, making a rank-basis evaluation of the employees concerned, or making an analysis, the report always seems to be more efficient and easy with the help of window functions.

4. Data Analytics Enhancement

Percentiles can be calculated, moving averages can be derived and identifying patterns in time series data becomes relatively easier when employing window functions. The defined functions give the cognitive ability of the data analysts to extend themselves to the provision of more insight of the data without defining numerous subqueries or additional tables.

Scenarios for the Window Function

There are numerous scenarios where window functions can be applied in data analytics. For instance, while performing a data analysis regarding sales, it might be desirable to get a moving average of sales of each item for a time-frame. It is feasible to achieve this by using Window Functions in SAS, performing one calculation for one row and retaining all other pieces of information.

Another popular scenario is ranking. If you want to rank employees in different departments based on certain metrics the employees possess. You can use tiered implementation of window functions across the same database to assign a rank to them along with enabling evaluation across departments.

In analysis of time series data, Window functions come in handy as well. For example, while keeping monthly data intact, you can always add up the months’ sales to get a yearly total or you can calculate the day-on-day sales with the help of LAG or LEAD functions.