Table of contents
Why window functions may come in handy
Let us dive directly into a real world example. You’ve got a table with salaries of different job functions and you are asked to find out what the n-tiles (quantiles, mean etc.) are of each function within a company. So this means you’ve got many people that hold a position which you want to compare with others. You want to compare salaries of marketing professionals to their peers, but you don’t want to compare said salaries to marketing directors because they have their own salary bands. So that means you need to frame (also called partition) job titles with their own with which you calculate the n-tiles. Without window functions, this could be a quite fussy task.
In order to walk through the problem, create a new database, I’m going to name mine FoehnWind [^1]. The following script creates a new table and inserts some sample data of 20 rows into it.
Tutorial
USE [FoehnWind];
GO
CREATE TABLE dbo.Employee
(ID TINYINT IDENTITY(1, 1) NOT NULL,
FirstName VARCHAR(150) NOT NULL,
LastName VARCHAR(150) NULL,
JobTitle VARCHAR(100) NOT NULL,
AnnualSalary INT NOT NULL,
CONSTRAINT [PK_dbo.Employee] PRIMARY KEY CLUSTERED(ID ASC)
)
ON [PRIMARY];
GO
INSERT INTO [dbo].[Employee]
VALUES
( 'Ken','Sanchez','Chief Executive Officer', 530000)
,( 'Jo','Berry','Marketing Assistant', 71875 )
,( 'Lori','Penor','Marketing Assistant', 73398 )
,( 'Pat','Coleman','Marketing Assistant', 51874 )
,( 'Stuart','Macrae','Marketing Assistant', 59108 )
,( 'Mary','Dempsey','Marketing Assistant', 57641 )
,( 'Wanida','Benshoof','Marketing Assistant', 60779 )
,( 'David','Bradley','Marketing Manager', 130000 )
,( 'Jill','Williams','Marketing Specialist', 108117 )
,( 'John','Wood','Marketing Specialist', 99407 )
,( 'Sariya','Pit','Marketing Specialist', 95048 )
,( 'Terry','Eminhizer','Marketing Specialist', 106316 )
,( 'Michael','Sullivan','Marketing Specialist', 87047 )
,( 'Ovidiu','Cracium','Marketing Specialist', 107883 )
,( 'Rob','Walters','Marketing Specialist', 109369 )
,( 'Garrett','Young','Senior Marketing Specialist', 118051 )
,( 'George','Li','Senior Marketing Specialist', 117253 )
,( 'Russell','King','Senior Marketing Specialist', 123362 )
,( 'Sandra','Alayo','Senior Marketing Specialist', 118185 )
,( 'Denise','Smith','Vice President Marketing', 250000 );
GO
Firstly, lets use a window function which calculates the averages and percentiles of salaries for each job title:
SELECT distinct e.JobTitle
,AVG(e.AnnualSalary) OVER (PARTITION BY e.JobTitle) AS AverageSalary
,PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY e.AnnualSalary)
OVER (PARTITION BY e.JobTitle) as MedianContinous
,PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY e.AnnualSalary)
OVER (PARTITION BY e.JobTitle) as MedianDisContinous
FROM dbo.Employee e
order by e.JobTitle
You perhaps now wondered how the functions PERCENTILE_CONT()
and PERCENTILE_DISC()
differ. The discontinous variant returns an actual value that exists in the calculated set and pointing out that this values is the closest one being to the calculative median. The continous percentile in contrast returns a value that is actually the calculated median. For example, Marketing Assistant Stuart Macrae has an Annual Salary of 59’108. This amount is also the median in this job role, but the calculated median is 59’943.5 so Stuart is slightly below the median but his salary is the closest one to the median.
JobTitle | EmployeeName | AnnualSalary | AverageSalary | MedianContinous | MedianDisContinous |
Chief Executive Officer | Ken Sanchez | 530000 | 530000 | 530000 | 530000 |
Marketing Assistant | Pat Coleman | 51874 | 62445 | 59943.5 | 59108 |
Marketing Assistant | Mary Dempsey | 57641 | 62445 | 59943.5 | 59108 |
Marketing Assistant | Stuart Macrae | 59108 | 62445 | 59943.5 | 59108 |
Marketing Assistant | Wanida Benshoof | 60779 | 62445 | 59943.5 | 59108 |
Marketing Assistant | Jo Berry | 71875 | 62445 | 59943.5 | 59108 |
Marketing Assistant | Lori Penor | 73398 | 62445 | 59943.5 | 59108 |
Marketing Manager | David Bradley | 130000 | 130000 | 130000 | 130000 |
Marketing Specialist | Michael Sullivan | 87047 | 101883 | 106316 | 106316 |
Marketing Specialist | Sariya Pit | 95048 | 101883 | 106316 | 106316 |
Marketing Specialist | John Wood | 99407 | 101883 | 106316 | 106316 |
Marketing Specialist | Terry Eminhizer | 106316 | 101883 | 106316 | 106316 |
Marketing Specialist | Ovidiu Cracium | 107883 | 101883 | 106316 | 106316 |
Marketing Specialist | Jill Williams | 108117 | 101883 | 106316 | 106316 |
Marketing Specialist | Rob Walters | 109369 | 101883 | 106316 | 106316 |
Senior Marketing Specialist | George Li | 117253 | 119212 | 118118 | 118051 |
Senior Marketing Specialist | Garrett Young | 118051 | 119212 | 118118 | 118051 |
Senior Marketing Specialist | Sandra Alayo | 118185 | 119212 | 118118 | 118051 |
Senior Marketing Specialist | Russell King | 123362 | 119212 | 118118 | 118051 |
Vice President Marketing | Denise Smith | 250000 | 250000 | 250000 | 250000 |
Further reading
This was just an example of what window functions are handy for. More window functions are documented here. Ranked functions are used quite often to find out what happened over time, e.g. if a status of a product changed. If you want to dig even deeper into this topic, I recommend (no affiliation) the book T-SQL Window Functions: For data analysis and beyond (2nd Edition) (Developer Reference) by Itzik Ben-Gan. He's authored some excellent books with great knowledge.
[^1]: Fohen wind is a fairly common wind in the European Alps, which is a reference to the popular sample database Nordwind.
Cover Photo Credit: msandersmusic, on Pixabay