Posted by: Joe | July 24, 2009

The effects of DISTINCT in a SQL query

Earlier today, I had a great Twitter conversation with Tim Mitchell (@Tim_Mitchell), Jorge Segarra (@SQLChicken), and Jack Corbett (@unclebiguns) about the use of DISTINCT in a SQL Server query. By the way, this is yet another example of how Twitter can be used in a good and positive way within the work environment and within the SQL Server Community; Twitter is not just a time sink that squeezes productivity out of your day as some claim.

Many people (not those mentioned previously) resort to using the DISTINCT keyword in a SELECT statement to remove duplicates returned by a query. The fact that the resultset has duplicates is frequently (though not always) the result of a poor database design, an ineffective query, or both. In any case, issuing the query without the DISTINCT keyword yields more rows than expected or needed so the keyword is employed to limit what is returned to the user.

So what’s the problem? Is using DISTINCT that really bad? Why not just type the 9 additional keystrokes that will save us a lot of hard work? (8 letters and a space. You thought I miscounted, didn’t you?)

Running without the DISTINCT keyword

To answer that question, let’s have a look at a quick and simple example. Let’s consider the following query run against the AdventureWorks database.

withoutdistinct

As you can see from the query and its execution plan above, SQL Server resolves this query using a Clustered Index Scan, aka a table scan. Generally speaking, table scans are not something you want to see in your query plans but in this case it makes sense. I’m asking for 4 columns from all rows in the table.

Now let’s look at the cost of the subtree below. We see that the estimated subtree cost for the SELECT statement is 0.438585.

Looking at the execution time information from SET STATISTICS TIME ON we see the following information.

SQL Server Execution Times:
CPU time = 15 ms,  elapsed time = 586 ms.

Running with the DISTINCT keyword

Now, let’s type those 9 additional keystrokes, adding the DISTINCT keyword into the query as shown below. Before we run this, we’ll clear out our buffers and procedure cache to make sure we’re comparing apples to apples.

DBCC DROPCLEANBUFFERS
DBCC FREEPROCCACHE

NOTE: Do not run the aforementioned DBCC commands on a production server. If you do, your phone will ring, your pager will go vibrate, your users will have a hard time forgiving you, and performance will slow to a crawl for a little while.

withdistinct

A quick examination of the query plan reveals that a table scan is still being used to retrieve the data from the table. That’s to be expected. However, a new node has been added to the plan. A Sort node. This node is responsible for 75% of the total cost of the new query!

Looking at the node cost for the SELECT statement, we see that the cost shot up to 1.75954 from 0.438585. That’s a significant difference.

And finally comparing the time it took to resolve this query with the prior version, we can see that the total CPU time increased too.

SQL Server Execution Times:
CPU time = 78 ms,  elapsed time = 655 ms.

An exception to every rule

So, generally speaking the addition of DISTINCT to a SELECT statement will increase the workload for SQL Server. But this is not always the case. If a unique index exists that can guarantee that each row will indeed be distinct, the Sort operation can be skipped and the results immediately returned.

For example, consider the following trivial query as an example.

And compare it to the following query that includes the DISTINCT keyword.

Notice that the query plans are identical for these two queries. Why? A unique index exists on the ContactID column which guarantees that the rows will be distinct.

But wait, there’s more

These are really trivial examples of how DISTINCT can make a difference in a query plan and thus the performance of a query. In real life, very few queries are this simple. But I hope that these examples will serve to illustrate that DISTINCT does add an addtional load on the SQL Server. So, if you can do a little work up front by rewriting your query, you can avoid having use DISTINCT in your query. SQL Server will thank you for it.

For more information how a unique index can help performance, have a look at Rob Farley’s blog post entitled “Unique Indexes with GROUP BY“. It’s a good read.

Have you resorted to using DISTINCT to limit the results of a query? I’d love to hear of your experiences. Or if you’ve found an better alternative, please share.

Joe



Responses

  1. Joe, good post. I’ve been a critic of the overuse of DISTINCT for a while, largely because it’s often used to mask underlying query problems. I’ve seen it used in a “just in case we end up with dupes in this query” methodology far too often. I’ll bookmark this post as tangible evidence that it should be used with caution.

  2. Thanks Joe, like Tim said I’ll be referencing this post in the future if I ever get any flak or argument about DISTINCT queries!

  3. The effects of DISTINCT in a SQL query…

    Kudos for a great Sql Server article – Trackback from SqlServerKudos…

  4. [...] The effects of DISTINCT in a SQL query (Joe Webb) [...]

  5. [...] Joe Webb explains The effects of DISTINCT in a SQL query. [...]

  6. Good article but as u written distinct performs one more functionality on resultset. So, definately it should add extra time so i don’t think it is really bad as it is doing one more functionality for us.

  7. Great post Joe, thanks for sharing it with us.

    I also got similar results when conducting a Performance Comparison of (SELECT TOP 1) Verses MAX(), i.e. it depends :-)

    http://tinyurl.com/kojbp2

  8. [...] The Effects of SELECT DISTINCT – Joe Webb explains with scripts and diagrams. [...]

  9. [...] The effects of DISTINCT in a SQL query [...]

  10. [...] Webb of WebbTech exposes the effects of DISTINCT in a SQL query, introducing it thus: “Many people (not those mentioned previously) resort to using the [...]

  11. Nice post! Having spent a lot time tuning SQL queries, I’ve learned to be suspicious of any queries that use DISTINCT. Some times the use is legitimate, but more frequently you can get rid of them with properly SQL manipulation.

  12. Hi Joe! At the end you ask for experiences or alternatives. One alternative that I have found, is putting additional criteria in the join clause, which I wrote about here: http://www.sqlteam.com/article/additional-criteria-in-the-join-clause.

    I recently ran into this again, when debugging a query that joined to a junction table and needed additional criteria in order to return only one row instead of multiple (e.g. LEFT JOIN myJunction as J ON T1.Key1 = J.Key1 AND J.Key2 = ‘KnownLimitingValue’).

    A few months ago I also wrote on this subject, not so much from a performance point of view, but from an accuracy point of view. See http://weblogs.sqlteam.com/markc/archive/2008/11/11/60752.aspx.

  13. thanks….this is very usefull

  14. [...] The effects UNION in a SQL query In a recent blog, I shared some of the natural but sometimes unanticipated consequences of adding DISTINCT to a SELECT statement. Adding DISTINCT ensures that you don’t have duplicated rows in your result set, but that assurance comes at a price. Performance is typically impacted negatively when you add DISTINCT to a query. You can read the blog and see the proof here. [...]

  15. [...] In a recent blog, I shared some of the natural but sometimes unanticipated consequences of adding DISTINCT to a SELECT statement. Adding DISTINCT ensures that you don’t have duplicated rows in your result set, but that assurance comes at a price. Performance is typically impacted negatively when you add DISTINCT to a query. You can read the blog and see the proof here. [...]

  16. Thank you very much for the details of DISTINCT clause! It gave a clear picture

  17. Thanks Suman. I’m glad you’ve found these posts useful.

    Joe


Leave a response

Your response:

Categories