Over the years, I’ve written a lot of SQL and in that time I’ve learned some tricks to creating better SQL with the goal of making it easier to maintain, and faster.
With all the SQL I had to write, I made an attempt to learn how to do it better. I tried to find articles on the subject. However, a lot of those dealt with the technical nature of the server itself, and could be summarized by saying “Turn on statistics and regenerate them often.” Their examples were also dealing with simple queries. I found very few that just taught you how to write efficient SQL while dealing with complicated queries.
So, I said the old cliche: “I could do this better” and here we are.
I will preface this article with the warning that my background does not come from knowing the innards of the SQL servers. Most of my knowledge comes from trying different techniques and seeing what happens. And when you see a small change cut the running time to a fraction of what it was, you tend to think that was a good idea. I have accepted the MS SQL optimizer into my life.
A lot of the example here are for very specific circumstances. In actual cases like them you probably wouldn’t care about efficiency because you don’t intend to run it multiple times. The difference between queries taking half a second to run versus 30 seconds to run are minimal when you only run it once. But these contrived examples should give you useful information for queries that you actually will run lots of times.
I have never been comfortable with SQL that is generated by a tool. Maybe it is good for small queries, but as soon as things become complicated, the SQL that is created is unintuitive to follow. If someone has to modify it, or debug it, they may spend a lot of time just trying to figure out what the original intent of the SQL was.
Spend the extra time to create the SQL by hand. You will have a chance to put in some comments, or at least name things appropriately. The next person to maintain it (which will probably be you) will thank you for it.
GROUP BYWhen Dealing With Duplicates
GROUP BYto cause those duplicates to be merged together. The problem with that approach is that it is hard to maintain, and it is not always clear what is being intended.
For example, assume we have a retail store. We have discovered that the granite rocks we’ve been selling are actually weapons-grade plutonium. We now need to find out everyone who has ever purchased our granite rock so we can send the nice people from the military to explain the mistake to them.
SELECT tInvoices.Name FROM tProducts JOIN tInvoiceItems ON (tInvoiceItems.ProductId = tProducts.ProductId) JOIN tInvoices ON (tInvoices.InvoiceId = tInvoiceItems.InvoiceId) WHERE tProduct.ProductName = 'granite rock'
When we run that query, we discover that Emmett Brown has purchased the granite rocks on two separate occasions. He’s going to appear twice in the list. Since we don’t want to waste the military’s time, we might do something like this:
SELECT tInvoices.Name FROM tProducts JOIN tInvoiceItems ON (tInvoiceItems.ProductId = tProducts.ProductId) JOIN tInvoices ON (tInvoices.InvoiceId = tInvoiceItems.InvoiceId) WHERE tProduct.ProductName = 'granite rock' GROUP BY tInvoices.Name
It works, but what happens if the military would also like the address of all of these potential troublemakers? We add the additional fields to the
SELECT of the query, but now we need to also add each of those fields to the
GROUP BY clause.
Really, we don’t care about the grouping, and it is disingenuous of us to pretend we are. Instead, we should use the
DISTINCT keyword in the query.
SELECT DISTINCT tInvoices.Name, tInvoices.Street, tInvoices.City, tInvoices.State, tInvoices.MailingCode FROM tProducts JOIN tInvoiceItems ON (tInvoiceItems.ProductId = tProducts.ProductId) JOIN tInvoices ON (tInvoices.InvoiceId = tInvoiceItems.InvoiceId) WHERE tProduct.ProductName = 'granite rock'
That seems painfully straightforward, but it still needed saying. It becomes clear what is being tried to be done, and we will also work better with aggregate functions.
There is a tendency to try and write your SQL queries with one
SELECT and as many
JOINs as you need to to finish the job. While this may work, it is very easy for errors to creep in.
In the previous example, we had a VERY simple database. If it was more realistic, we would have several more tables between the granite rock and the customer. This is more likely what would have happened:
SELECT DISTINCT tCustomer.LastName, tCustomer.FirstName, tAddress.Street, tAddress.City, tAddress.State, tAddress.MailingCode FROM tProducts JOIN tInvoiceItems ON (tInvoiceItems.ProductId = tProducts.ProductId) JOIN tInvoices ON (tInvoices.InvoiceId = tInvoiceItems.InvoiceId) LEFT JOIN tCustomer ON (tCustomer.CustomerId = tBadCustomerIds.CustomerId) LEFT JOIN tAddress ON (tAddress.AddressId = tCustomer.AddressId) WHERE tProduct.ProductName = 'granite rock'
This works, but it could be better. The
DISTINCT has a lot of unnecessary work to do; it is comparing all the fields still. Maybe the optimizer will take care of that for us, but the optimizer doesn’t always help when you are a human that is trying to figure out what is going on. The customer table can easily do some of the processing better.
Instead, we should have the invoices return a list of
DISTINCT customerIds, which we will refer to as
tBadCustomerIds. We can then
JOIN that to the actual customer table and get the results nicely, clearly, and easier to follow:
SELECT tCustomer.LastName, tCustomer.FirstName, tAddress.Street, tAddress.City, tAddress.State, tAddress.MailingCode FROM ( SELECT DISTINCT tInvoices.CustomerId FROM tProducts JOIN tInvoiceItems ON (tInvoiceItems.ProductId = tProducts.ProductId) JOIN tInvoices ON (tInvoices.InvoiceId = tInvoiceItems.InvoiceId) WHERE tProduct.ProductName = 'granite rock' ) tBadCustomerIds LEFT JOIN tCustomer ON (tCustomer.CustomerId = tBadCustomerIds.CustomerId) LEFT JOIN tAddress ON (tAddress.AddressId = tCustomer.AddressId)
That shows a way to make the SQL more efficient, and encapsulate different thoughts into different sections, but it is also a good way to prevent errors.
WITHclause makes subqueries clearer
Note: Unfortunately, there is no evidence that the WITH clause exists within MySQL, which is a shame.
WITH clause is a great tool to better organize your thoughts. It is commonly used to use the same SELECT multiple times, but I’ve found it useful even if it is only used once.
WITH clause lets you create a subquery outside of the
SELECT that it will be used in. In our previous example, we could have written our query like this:
It does the exact same thing, but we can now make it clear what is actually happening. We have named our subquery as a table
tBadCustomerIds which in and of itself, better documents the process. This makes the code easier to maintain in the future.
WITH tBadCustomerIds (CustomerId) ( SELECT DISTINCT tInvoices.CustomerId FROM tProducts JOIN tInvoiceItems ON (tInvoiceItems.ProductId = tProducts.ProductId) JOIN tInvoices ON (tInvoices.InvoiceId = tInvoiceItems.InvoiceId) WHERE tProduct.ProductName = 'granite rock') ) SELECT tCustomer.LastName, tCustomer.FirstName, tAddress.Street, tAddress.City, tAddress.State, tAddress.MailingCode FROM tBadCustomerIds LEFT JOIN tCustomer ON (tCustomer.CustomerId = tBadCustomerIds.CustomerId) LEFT JOIN tAddress ON (tAddress.AddressId = tCustomer.AddressId)
LEFT JOINby default
When I write a
SELECT statement that
JOINs with a number of tables, I always default to using
LEFT JOIN. It is the easiest
JOIN to wrap your head around. I can summarize it as this: keep everything you’ve written so far, and add in this new table as well.
It also appears to be the most efficient.
INNER JOIN will filter out parts on the left, and may have to go through tables it already has gone through. Whereas a
LEFT JOIN basically says don't worry about it and just find the things on the right that we are interested in. I've seen the processing time for SQL drop from 90 seconds to five seconds with a change of a few
I do not know if this efficiency is apparent if the
JOIN in question is sufficiently defined with foreign keys. In that case, the two
JOINs may be identical. I have not spent time to check, because the
LEFT JOIN is working well for me.
JOINmay cause rows to disappear
INNER JOIN (i.e. the standard
JOIN) can have entire rows disappear, especially in complicated queries. If you are
JOINing on a field that happens to be null in a particular column, that will guarantee that that row will disappear. In a complicated query, you may not even be sure which
JOIN is causing that behaviour. I've most often seen this happen when using an auto-generated query that may not have had all of the foreign keys set up perfectly. It is better to use a
LEFT JOIN to begin with, and then pare down to the appropriate
INNER JOIN when you are sure that the field in question is a not nullable foreign key.
JOINs don’t play well with aggregate functions
When you are using an aggregate function, be sure that you are only joining the tables you explicitly need for that aggregate. It is an easy temptation to keep
JOINing tables as you need them, without giving a care to an aggregate function you may use later. In those cases, it is better to use a subquery so that you only
JOIN with the proper tables. Use a subquery to make it more clear what you are trying to do.
Note: If you are using the
OVER clause, then this may not be relevant, but if you don’t regularly use the
OVER clause, it is too easy to forget how to use it.
If you GROUP BY a number of fields, the aggregate function will work on all of those fields. For example, if you have a complicated
SELECT where you are trying to group a number of entries into a report, you might have something like this:
SELECT h.CreatedBy, h.EmployeeId, h.CreatedDate, SUM(h.Amount) FROM tChequeHistory h GROUP BY h.CreatedBy, h.EmployeeId, h.CreatedDate
This will list all of the cheque history entries by the person who created them, then by the employer, and then by creation date, and give a sum. So you will be able to see the amount each administration staff paid to each employee on each date, totaling if there are multiple entries for a single date.
But what if we also want to know the number of employees that each staff member interacted with? We can’t easily add in a
COUNT(DISTINCT h.EmployeeId) because the
CreatedDate entry will cause it to return the count for a particular date.
In this case, we need to separate out the parts, and a
WITH clause will help here.
WITH tRelevantHistory (CreatedBy, EmployeeId, CreatedDate, Amount) AS ( SELECT h.CreatedBy, h.EmployeeId, h.CreatedDate, SUM(h.Amount) FROM tChequeHistory h GROUP BY h.CreatedBy, h.EmployeeId, h.CreatedDate ), tCreatedCount(CreatedBy, EmployeeCount) ( SELECT h.CreatedBy, COUNT(DISTINCT h.EmployeeId) FROM tRelevantHistory h GROUP BY h.CreatedBy ) SELECT * FROM tRelevantHistory h LEFT JOIN tCreatedCount cnt ON (cnt.CreatedBy = h.CreatedBy)
If you have a function that returns a value for a parameter, and you try to use it on all rows in a table, it slows down a lot. You are essentially calling the function from scratch for every row in the table.
Let’s say you have a
FUNCTION that does a semi-complicated algorithm on a single employee to get a status about that person on a particular date. For example it returns the id for a table to indicate whether they are sick, on vacation, working, dead, or any of a variety of states they could be.
You would probably want to use that function if you wanted to get the state of every employee on a particular date. However, that will be very inefficient. It will call that function for every employee, but there won’t be any attempt to do any efficiency that might be possible over a larger set.
In these cases, you may need to write two functions:
CREATE FUNCTION EmployeeStatusIdOnDate (@EmployeeId int, @OnDate Date) RETURNS int
CREATE FUNCTION EmployeesStatusIdOnDate (@OnDate Date) RETURNS @tEmployeeStatus TABLE ( EmployeeId int, StatusId int )
The first will get the status for one employee, the second for all employees.
Unfortunately it often isn’t easy to get all of the statuses efficiently. When you are dealing with a single member, you can use the aggregate functions to get information about one employee quickly. However, those same aggregate functions become effectively useless if they are used over all employees. And the
TOP clause becomes laughably inaccurate.
To demonstrate, here are two separate functions for doing the same thing, one for one employee, another for all employees. The calculation is for getting the status of an employee on a particular date. Complicating things is that we have to go through journals that can be cancelled, and they are sorted on two fields,
We will start with the function optimized for one employee.
CREATE FUNCTION EmployeeStatusIdOnDate (@EmployeeId int, @OnDate Date) RETURNS int AS BEGIN DECLARE @Result int; SELECT TOP 1 @Result = NewStatusId FROM tEmployeeStatusJournal j WHERE j.EmployeeId = @EmployeeId AND j.NewStatusId is not null AND j.EffectiveDate <= @OnDate AND j.Cancelled = 0 ORDER BY j.EffectiveDate DESC, j.TransactionDate DESC RETURN COALESCE(@Result, SELECT StatusId FROM tEmployeeStatuses WHERE StatusValue = 'Good'); END GO
This is the function we can use when we want to run the same operation on all employees. It is not pretty because we have two fields we are sorting on, so we have to do the complicated SQL dance to get the entries in the proper order.
CREATE FUNCTION EmployeesStatusIdOnDate (@OnDate Date) RETURNS @tEmployeeStatus TABLE ( EmployeeId int, StatusId int ) AS BEGIN WITH StatusJournals(EmployeeId, JournalId, EffectiveDate, TransactionDate) AS ( SELECT j.EmployeeId, j.NewStatusId, j.EffectiveDate, j.TransactionDate FROM tEmployeeStatusJournal j WHERE j.NewStatusId is not null AND j.EffectiveDate <= @OnDate AND j.Cancelled = 0 ), MaxEffectiveDateStatus AS ( SELECT j.EmployeeId, j.NewStatusId FROM tEmployeeStatusJournal j JOIN ( -- Get the journal id with the maximum effective date and maximum transaction date SELECT two.EmployeeId, two.EffectiveDate, MAX(two.TransactionDate) AS MaxTransactionDate FROM StatusJournals two JOIN -- Join with a table of employees and maximum effective dates (SELECT EmployeeId, MAX(EffectiveDate) AS MaxEffectiveDate FROM StatusJournals GROUP BY EmployeeId) one ON (one.EmployeeId = two.EmployeeId AND one.MaxEffectiveDate = two.EffectiveDate) GROUP BY two.EmployeeId, two.EffectiveDate ) latestStatusEntry ON (j.EmployeeId = latestStatusEntry.EmployeeId AND latestStatusEntry.EffectiveDate =j.EffectiveDate AND latestStatusEntry.MinTransactionDate = j.TransactionDate) ) INSERT @tEmployeeStatus SELECT emp.EmployeeId, COALESCE(maxStat.NewStatusId, SELECT StatusId FROM tEmployeeStatuses WHERE StatusValue = 'Good') AS StatusId FROM tEmployee emp LEFT JOIN MaxEffectiveDateStatus maxStat ON (maxStat.EmployeeId = emp.EmployeeId) RETURN END
But, now that we have the complicated table function, we can use it when we need to make a report of every employee's status on a particular day.
SELECT emp.Name, s.StatusValue AS CurrentStatus FROM tEmployee emp LEFT JOIN dbo.EmployeesStatusIdOnDate(GETDATE()) stats ON (stats.EmployeeId = emp.EmployeeId) LEFT JOIN tEmployeeStatuses s ON (s.StatusId = stats.StatusId)
Cursors are inherently slower than using the set based approach that SQL encourages. However, there are some cases where it is actually faster to use the cursor. In these cases, I would encourage you to use a
DO-WHILE loop instead of a
WHILE @@FETCH_STATUS = 0.
The way a typical
CURSOR is implemented is similar to this:
DECLARE myCursor CURSOR FOR SELECT ... OPEN myCursor FETCH NEXT FROM myCursor into @var1, @var2, @var3 WHILE @@FETCH_STATUS = 0 BEGIN -- Do Stuff FETCH NEXT FROM myCursor into @var1, @var2, @var3 END CLOSE myCursor; DEALLOCATE myCursor;
While that works, it has a fundamental problem: you have code replication instead of code reuse. The
FETCH NEXT FROM myCursor into @var1, @var2, @var3 code is duplicated twice. If you change the
SELECT statement used by the cursor, you will have to update the two instances of
FETCH NEXT with the new variable list. We are human; it is too easy to make a mistake and not remember to correct both places. This is my preferred method of doing a
DECLARE myCursor CURSOR FOR SELECT ... OPEN myCursor DO BEGIN FETCH NEXT FROM myCursor into @var1, @var2, @var3 IF (@@FETCH_STATUS <> 0) BREAK; -- Do Stuff END WHILE (1=1) CLOSE myCursor; DEALLOCATE myCursor;
UNION versus a
UNION ALL are remarkably different.
The former will, secretly, drop duplicates. The latter will keep the duplicates. This has bitten me in the past, so I leave this as a warning to others.
If you are totalling up the hours worked by employees in two different tables, you can cause entries to disappear. Be aware of the side-effects of the version you are using.
The syntax highlighting for the SQL here was done initially with tohtml.com
2016-7-4: Inital writing