SQL Optimization

Over the years, I’ve written a lot of SQL and in that time I’ve learned some tricks to creating better SQL with the goal of making it easier to maintain, and faster.

With all the SQL I had to write, I made an attempt to learn how to do it better. I tried to find articles on the subject. However, a lot of those dealt with the technical nature of the server itself, and could be summarized by saying “Turn on statistics and regenerate them often.” Their examples were also dealing with simple queries. I found very few that just taught you how to write efficient SQL while dealing with complicated queries.

So, I said the old cliche: “I could do this better” and here we are.

I will preface this article with the warning that my background does not come from knowing the innards of the SQL servers. Most of my knowledge comes from trying different techniques and seeing what happens. And when you see a small change cut the running time to a fraction of what it was, you tend to think that was a good idea. I have accepted the MS SQL optimizer into my life.

A lot of the example here are for very specific circumstances. In actual cases like them you probably wouldn’t care about efficiency because you don’t intend to run it multiple times. The difference between queries taking half a second to run versus 30 seconds to run are minimal when you only run it once. But these contrived examples should give you useful information for queries that you actually will run lots of times.

Ignore Auto-generated SQL

I have never been comfortable with SQL that is generated by a tool. Maybe it is good for small queries, but as soon as things become complicated, the SQL that is created is unintuitive to follow. If someone has to modify it, or debug it, they may spend a lot of time just trying to figure out what the original intent of the SQL was.

Spend the extra time to create the SQL by hand. You will have a chance to put in some comments, or at least name things appropriately. The next person to maintain it (which will probably be you) will thank you for it.

Avoid GROUP BY When Dealing With Duplicates

It happens often that an SQL query will bring back duplicate entries. A beginners approach for this is to use GROUP BY to cause those duplicates to be merged together. The problem with that approach is that it is hard to maintain, and it is not always clear what is being intended.

For example, assume we have a retail store. We have discovered that the granite rocks we’ve been selling are actually weapons-grade plutonium. We now need to find out everyone who has ever purchased our granite rock so we can send the nice people from the military to explain the mistake to them.

SELECT tInvoices.Name
FROM tProducts 
	JOIN tInvoiceItems ON (tInvoiceItems.ProductId = tProducts.ProductId)
	JOIN tInvoices ON (tInvoices.InvoiceId = tInvoiceItems.InvoiceId)
WHERE tProduct.ProductName = 'granite rock'

When we run that query, we discover that Emmett Brown has purchased the granite rocks on two separate occasions. He’s going to appear twice in the list. Since we don’t want to waste the military’s time, we might do something like this:

SELECT tInvoices.Name
FROM tProducts 
	JOIN tInvoiceItems ON (tInvoiceItems.ProductId = tProducts.ProductId)
	JOIN tInvoices ON (tInvoices.InvoiceId = tInvoiceItems.InvoiceId)
WHERE tProduct.ProductName = 'granite rock'
GROUP BY tInvoices.Name

It works, but what happens if the military would also like the address of all of these potential troublemakers? We add the additional fields to the SELECT of the query, but now we need to also add each of those fields to the GROUP BY clause.

Really, we don’t care about the grouping, and it is disingenuous of us to pretend we are. Instead, we should use the DISTINCT keyword in the query.

SELECT DISTINCT tInvoices.Name,
	tInvoices.Street, tInvoices.City, tInvoices.State, tInvoices.MailingCode
FROM tProducts 
	JOIN tInvoiceItems ON (tInvoiceItems.ProductId = tProducts.ProductId)
	JOIN tInvoices ON (tInvoices.InvoiceId = tInvoiceItems.InvoiceId)
WHERE tProduct.ProductName = 'granite rock'

That seems painfully straightforward, but it still needed saying. It becomes clear what is being tried to be done, and we will also work better with aggregate functions.

Subqueries Are Your Friend

There is a tendency to try and write your SQL queries with one SELECT and as many JOINs as you need to to finish the job. While this may work, it is very easy for errors to creep in.

In the previous example, we had a VERY simple database. If it was more realistic, we would have several more tables between the granite rock and the customer. This is more likely what would have happened:

SELECT DISTINCT tCustomer.LastName, tCustomer.FirstName,
	tAddress.Street, tAddress.City, tAddress.State, tAddress.MailingCode
FROM tProducts 
	JOIN tInvoiceItems ON (tInvoiceItems.ProductId = tProducts.ProductId)
	JOIN tInvoices ON (tInvoices.InvoiceId = tInvoiceItems.InvoiceId)
	LEFT JOIN tCustomer ON (tCustomer.CustomerId = tBadCustomerIds.CustomerId)
	LEFT JOIN tAddress ON (tAddress.AddressId = tCustomer.AddressId)
WHERE tProduct.ProductName = 'granite rock'

This works, but it could be better. The DISTINCT has a lot of unnecessary work to do; it is comparing all the fields still. Maybe the optimizer will take care of that for us, but the optimizer doesn’t always help when you are a human that is trying to figure out what is going on. The customer table can easily do some of the processing better.

Instead, we should have the invoices return a list of DISTINCT customerIds, which we will refer to as tBadCustomerIds. We can then JOIN that to the actual customer table and get the results nicely, clearly, and easier to follow:

SELECT tCustomer.LastName, tCustomer.FirstName,
	tAddress.Street, tAddress.City, tAddress.State, tAddress.MailingCode
FROM (
	SELECT DISTINCT tInvoices.CustomerId
	FROM tProducts 
		JOIN tInvoiceItems ON (tInvoiceItems.ProductId = tProducts.ProductId)
		JOIN tInvoices ON (tInvoices.InvoiceId = tInvoiceItems.InvoiceId)
	WHERE tProduct.ProductName = 'granite rock'
	)  tBadCustomerIds
	LEFT JOIN tCustomer ON (tCustomer.CustomerId = tBadCustomerIds.CustomerId)
	LEFT JOIN tAddress ON (tAddress.AddressId = tCustomer.AddressId)

That shows a way to make the SQL more efficient, and encapsulate different thoughts into different sections, but it is also a good way to prevent errors.

The WITH clause makes subqueries clearer

Note: Unfortunately, there is no evidence that the WITH clause exists within MySQL, which is a shame.

The WITH clause is a great tool to better organize your thoughts. It is commonly used to use the same SELECT multiple times, but I’ve found it useful even if it is only used once.

Essentially, a WITH clause lets you create a subquery outside of the SELECT that it will be used in. In our previous example, we could have written our query like this:

It does the exact same thing, but we can now make it clear what is actually happening. We have named our subquery as a table tBadCustomerIds which in and of itself, better documents the process. This makes the code easier to maintain in the future.

WITH tBadCustomerIds (CustomerId)
(
	SELECT DISTINCT tInvoices.CustomerId
		FROM tProducts 
			JOIN tInvoiceItems ON (tInvoiceItems.ProductId = tProducts.ProductId)
			JOIN tInvoices ON (tInvoices.InvoiceId = tInvoiceItems.InvoiceId)
		WHERE tProduct.ProductName = 'granite rock')
)
SELECT tCustomer.LastName, tCustomer.FirstName,
	tAddress.Street, tAddress.City, tAddress.State, tAddress.MailingCode
FROM tBadCustomerIds
	LEFT JOIN tCustomer ON (tCustomer.CustomerId = tBadCustomerIds.CustomerId)
	LEFT JOIN tAddress ON (tAddress.AddressId = tCustomer.AddressId)

Use LEFT JOIN by default

When I write a SELECT statement that JOINs with a number of tables, I always default to using LEFT JOIN. It is the easiest JOIN to wrap your head around. I can summarize it as this: keep everything you’ve written so far, and add in this new table as well.

It also appears to be the most efficient.

An INNER JOIN will filter out parts on the left, and may have to go through tables it already has gone through. Whereas a LEFT JOIN basically says don't worry about it and just find the things on the right that we are interested in. I've seen the processing time for SQL drop from 90 seconds to five seconds with a change of a few JOINs to LEFT JOINs.

I do not know if this efficiency is apparent if the JOIN in question is sufficiently defined with foreign keys. In that case, the two JOINs may be identical. I have not spent time to check, because the LEFT JOIN is working well for me.

The standard JOIN may cause rows to disappear

An INNER JOIN (i.e. the standard JOIN) can have entire rows disappear, especially in complicated queries. If you are JOINing on a field that happens to be null in a particular column, that will guarantee that that row will disappear. In a complicated query, you may not even be sure which JOIN is causing that behaviour. I've most often seen this happen when using an auto-generated query that may not have had all of the foreign keys set up perfectly. It is better to use a LEFT JOIN to begin with, and then pare down to the appropriate INNER JOIN when you are sure that the field in question is a not nullable foreign key.

Superfluous JOINs don’t play well with aggregate functions

When you are using an aggregate function, be sure that you are only joining the tables you explicitly need for that aggregate. It is an easy temptation to keep JOINing tables as you need them, without giving a care to an aggregate function you may use later. In those cases, it is better to use a subquery so that you only JOIN with the proper tables. Use a subquery to make it more clear what you are trying to do.

Aggregate Functions Are All or Nothing

Note: If you are using the OVER clause, then this may not be relevant, but if you don’t regularly use the OVER clause, it is too easy to forget how to use it.

If you GROUP BY a number of fields, the aggregate function will work on all of those fields. For example, if you have a complicated SELECT where you are trying to group a number of entries into a report, you might have something like this:

SELECT h.CreatedBy, h.EmployeeId, h.CreatedDate, SUM(h.Amount)
FROM tChequeHistory h
GROUP BY h.CreatedBy, h.EmployeeId, h.CreatedDate

This will list all of the cheque history entries by the person who created them, then by the employer, and then by creation date, and give a sum. So you will be able to see the amount each administration staff paid to each employee on each date, totaling if there are multiple entries for a single date.

But what if we also want to know the number of employees that each staff member interacted with? We can’t easily add in a COUNT(DISTINCT h.EmployeeId) because the CreatedDate entry will cause it to return the count for a particular date.

In this case, we need to separate out the parts, and a WITH clause will help here.

WITH tRelevantHistory (CreatedBy, EmployeeId, CreatedDate, Amount) AS
(
	SELECT h.CreatedBy, h.EmployeeId, h.CreatedDate, SUM(h.Amount)
	FROM tChequeHistory h
	GROUP BY h.CreatedBy, h.EmployeeId, h.CreatedDate
),
tCreatedCount(CreatedBy, EmployeeCount)
(
	SELECT h.CreatedBy, COUNT(DISTINCT h.EmployeeId)
	FROM tRelevantHistory h
	GROUP BY h.CreatedBy
)
SELECT *
FROM tRelevantHistory h
	LEFT JOIN tCreatedCount cnt ON (cnt.CreatedBy = h.CreatedBy)

You Can’t Always Reuse Code

If you have a function that returns a value for a parameter, and you try to use it on all rows in a table, it slows down a lot. You are essentially calling the function from scratch for every row in the table.

Let’s say you have a FUNCTION that does a semi-complicated algorithm on a single employee to get a status about that person on a particular date. For example it returns the id for a table to indicate whether they are sick, on vacation, working, dead, or any of a variety of states they could be.

You would probably want to use that function if you wanted to get the state of every employee on a particular date. However, that will be very inefficient. It will call that function for every employee, but there won’t be any attempt to do any efficiency that might be possible over a larger set.

In these cases, you may need to write two functions:

CREATE FUNCTION EmployeeStatusIdOnDate (@EmployeeId int, @OnDate Date) RETURNS int

and

CREATE FUNCTION EmployeesStatusIdOnDate (@OnDate Date) 
RETURNS @tEmployeeStatus TABLE
(
	EmployeeId int,
	StatusId int
)

The first will get the status for one employee, the second for all employees.

Unfortunately it often isn’t easy to get all of the statuses efficiently. When you are dealing with a single member, you can use the aggregate functions to get information about one employee quickly. However, those same aggregate functions become effectively useless if they are used over all employees. And the TOP clause becomes laughably inaccurate.

To demonstrate, here are two separate functions for doing the same thing, one for one employee, another for all employees. The calculation is for getting the status of an employee on a particular date. Complicating things is that we have to go through journals that can be cancelled, and they are sorted on two fields, EffectiveDate and TransactionDate.

We will start with the function optimized for one employee.

CREATE FUNCTION EmployeeStatusIdOnDate (@EmployeeId int, @OnDate Date) RETURNS int
AS
BEGIN
	DECLARE @Result int;
 
	SELECT TOP 1 @Result = NewStatusId
	FROM tEmployeeStatusJournal j
	WHERE j.EmployeeId = @EmployeeId
		AND j.NewStatusId is not null
		AND j.EffectiveDate <= @OnDate
		AND j.Cancelled = 0
	ORDER BY j.EffectiveDate DESC, j.TransactionDate DESC

	RETURN COALESCE(@Result, SELECT StatusId FROM tEmployeeStatuses WHERE StatusValue = 'Good');
END
GO

This is the function we can use when we want to run the same operation on all employees. It is not pretty because we have two fields we are sorting on, so we have to do the complicated SQL dance to get the entries in the proper order.

CREATE FUNCTION EmployeesStatusIdOnDate (@OnDate Date) 
RETURNS @tEmployeeStatus TABLE
(
	EmployeeId int,
	StatusId int
)
AS
BEGIN
	WITH StatusJournals(EmployeeId, JournalId, EffectiveDate, TransactionDate)
	AS
	(
		SELECT j.EmployeeId, j.NewStatusId, j.EffectiveDate, j.TransactionDate 
		FROM tEmployeeStatusJournal  j
		WHERE j.NewStatusId is not null
			AND j.EffectiveDate <= @OnDate
			AND j.Cancelled = 0
	),
	MaxEffectiveDateStatus AS (
		SELECT j.EmployeeId, j.NewStatusId
		FROM tEmployeeStatusJournal  j
			JOIN (
					-- Get the journal id with the maximum effective date and maximum transaction date
					SELECT two.EmployeeId, two.EffectiveDate, MAX(two.TransactionDate) AS MaxTransactionDate
					FROM StatusJournals two
						JOIN -- Join with a table of employees and maximum effective dates
							(SELECT EmployeeId, MAX(EffectiveDate) AS MaxEffectiveDate 
							FROM StatusJournals
							GROUP BY EmployeeId) one 
								ON (one.EmployeeId = two.EmployeeId 
									AND one.MaxEffectiveDate = two.EffectiveDate)
					GROUP BY two.EmployeeId, two.EffectiveDate
						) latestStatusEntry
				ON (j.EmployeeId = latestStatusEntry.EmployeeId 
					AND latestStatusEntry.EffectiveDate =j.EffectiveDate 
					AND latestStatusEntry.MinTransactionDate = j.TransactionDate)
	)
	INSERT @tEmployeeStatus
		SELECT emp.EmployeeId, COALESCE(maxStat.NewStatusId, SELECT StatusId FROM tEmployeeStatuses WHERE StatusValue = 'Good') AS StatusId
		FROM tEmployee emp
			LEFT JOIN MaxEffectiveDateStatus maxStat ON (maxStat.EmployeeId = emp.EmployeeId)
	RETURN
END

But, now that we have the complicated table function, we can use it when we need to make a report of every employee's status on a particular day.

SELECT emp.Name, s.StatusValue AS CurrentStatus
FROM tEmployee emp
	LEFT JOIN dbo.EmployeesStatusIdOnDate(GETDATE()) stats ON (stats.EmployeeId = emp.EmployeeId)
	LEFT JOIN tEmployeeStatuses s ON (s.StatusId = stats.StatusId)

Use a DO WHILE when using CURSORS

Cursors are inherently slower than using the set based approach that SQL encourages. However, there are some cases where it is actually faster to use the cursor. In these cases, I would encourage you to use a DO-WHILE loop instead of a WHILE @@FETCH_STATUS = 0.

The way a typical CURSOR is implemented is similar to this:

DECLARE myCursor CURSOR FOR SELECT ...
OPEN myCursor
FETCH NEXT FROM myCursor into @var1, @var2, @var3

WHILE @@FETCH_STATUS = 0
BEGIN
	-- Do Stuff
	FETCH NEXT FROM myCursor into @var1, @var2, @var3
END
CLOSE myCursor;
DEALLOCATE myCursor;

While that works, it has a fundamental problem: you have code replication instead of code reuse. The FETCH NEXT FROM myCursor into @var1, @var2, @var3 code is duplicated twice. If you change the SELECT statement used by the cursor, you will have to update the two instances of FETCH NEXT with the new variable list. We are human; it is too easy to make a mistake and not remember to correct both places. This is my preferred method of doing a CURSOR loop.

DECLARE myCursor CURSOR FOR SELECT ...
OPEN myCursor
DO
BEGIN
	FETCH NEXT FROM myCursor into @var1, @var2, @var3
	IF (@@FETCH_STATUS <> 0)
		BREAK;
	-- Do Stuff
END
WHILE (1=1)
CLOSE myCursor;
DEALLOCATE myCursor;

Be Careful of UNION

A UNION versus a UNION ALL are remarkably different.

The former will, secretly, drop duplicates. The latter will keep the duplicates. This has bitten me in the past, so I leave this as a warning to others.

If you are totalling up the hours worked by employees in two different tables, you can cause entries to disappear. Be aware of the side-effects of the version you are using.

Appendix

Credits

The syntax highlighting for the SQL here was done initially with tohtml.com

Version History

2016-7-4: Inital writing