The effects of DISTINCT in a SQL query

Earlier today, I had a great Twitter conversation with Tim Mitchell (@Tim_Mitchell), Jorge Segarra (@SQLChicken), and Jack Corbett (@unclebiguns) about the use of DISTINCT in a SQL Server query. By the way, this is yet another example of how Twitter can be used in a good and positive way within the work environment and within the SQL Server Community; Twitter is not just a time sink that squeezes productivity out of your day as some claim.

Many people (not those mentioned previously) resort to using the DISTINCT keyword in a SELECT statement to remove duplicates returned by a query. The fact that the resultset has duplicates is frequently (though not always) the result of a poor database design, an ineffective query, or both. In any case, issuing the query without the DISTINCT keyword yields more rows than expected or needed so the keyword is employed to limit what is returned to the user.

So what’s the problem? Is using DISTINCT that really bad? Why not just type the 9 additional keystrokes that will save us a lot of hard work? (8 letters and a space. You thought I miscounted, didn’t you?)

Running without the DISTINCT keyword

To answer that question, let’s have a look at a quick and simple example. Let’s consider the following query run against the AdventureWorks database.

withoutdistinct

As you can see from the query and its execution plan above, SQL Server resolves this query using a Clustered Index Scan, aka a table scan. Generally speaking, table scans are not something you want to see in your query plans but in this case it makes sense. I’m asking for 4 columns from all rows in the table.

Now let’s look at the cost of the subtree below. We see that the estimated subtree cost for the SELECT statement is 0.438585.

Looking at the execution time information from SET STATISTICS TIME ON we see the following information.

SQL Server Execution Times:
CPU time = 15 ms,  elapsed time = 586 ms.

Running with the DISTINCT keyword

Now, let’s type those 9 additional keystrokes, adding the DISTINCT keyword into the query as shown below. Before we run this, we’ll clear out our buffers and procedure cache to make sure we’re comparing apples to apples.

DBCC DROPCLEANBUFFERS
DBCC FREEPROCCACHE

NOTE: Do not run the aforementioned DBCC commands on a production server. If you do, your phone will ring, your pager will go vibrate, your users will have a hard time forgiving you, and performance will slow to a crawl for a little while.

withdistinct

A quick examination of the query plan reveals that a table scan is still being used to retrieve the data from the table. That’s to be expected. However, a new node has been added to the plan. A Sort node. This node is responsible for 75% of the total cost of the new query!

Looking at the node cost for the SELECT statement, we see that the cost shot up to 1.75954 from 0.438585. That’s a significant difference.

And finally comparing the time it took to resolve this query with the prior version, we can see that the total CPU time increased too.

SQL Server Execution Times:
CPU time = 78 ms,  elapsed time = 655 ms.

An exception to every rule

So, generally speaking the addition of DISTINCT to a SELECT statement will increase the workload for SQL Server. But this is not always the case. If a unique index exists that can guarantee that each row will indeed be distinct, the Sort operation can be skipped and the results immediately returned.

For example, consider the following trivial query as an example.

And compare it to the following query that includes the DISTINCT keyword.

Notice that the query plans are identical for these two queries. Why? A unique index exists on the ContactID column which guarantees that the rows will be distinct.

But wait, there’s more

These are really trivial examples of how DISTINCT can make a difference in a query plan and thus the performance of a query. In real life, very few queries are this simple. But I hope that these examples will serve to illustrate that DISTINCT does add an addtional load on the SQL Server. So, if you can do a little work up front by rewriting your query, you can avoid having use DISTINCT in your query. SQL Server will thank you for it.

For more information how a unique index can help performance, have a look at Rob Farley’s blog post entitled “Unique Indexes with GROUP BY“. It’s a good read.

Have you resorted to using DISTINCT to limit the results of a query? I’d love to hear of your experiences. Or if you’ve found an better alternative, please share.

Joe

About these ads

25 Responses to The effects of DISTINCT in a SQL query

  1. Tim Mitchell says:

    Joe, good post. I’ve been a critic of the overuse of DISTINCT for a while, largely because it’s often used to mask underlying query problems. I’ve seen it used in a “just in case we end up with dupes in this query” methodology far too often. I’ll bookmark this post as tangible evidence that it should be used with caution.

  2. Thanks Joe, like Tim said I’ll be referencing this post in the future if I ever get any flak or argument about DISTINCT queries!

  3. Pingback: SqlServerKudos

  4. Pingback: Dew Drop – Weekend Edition – July 25-26, 2009 | Alvin Ashcraft's Morning Dew

  5. Pingback: Weekly Link Post 103 « Rhonda Tipton’s WebLog

  6. Rajkumar says:

    Good article but as u written distinct performs one more functionality on resultset. So, definately it should add extra time so i don’t think it is really bad as it is doing one more functionality for us.

  7. John Sansom says:

    Great post Joe, thanks for sharing it with us.

    I also got similar results when conducting a Performance Comparison of (SELECT TOP 1) Verses MAX(), i.e. it depends :-)

    http://tinyurl.com/kojbp2

  8. Pingback: Weekly Links Recap for July 31 | Brent Ozar - SQL Server DBA

  9. Pingback: Something for the weekend 31/07/09 | John Sansom - SQL Server DBA in the UK

  10. Pingback: Log Buffer #156: a Carnival of the Vanities for DBAs | Pythian Group Blog

  11. Andre Araujo says:

    Nice post! Having spent a lot time tuning SQL queries, I’ve learned to be suspicious of any queries that use DISTINCT. Some times the use is legitimate, but more frequently you can get rid of them with properly SQL manipulation.

  12. AjarnMark says:

    Hi Joe! At the end you ask for experiences or alternatives. One alternative that I have found, is putting additional criteria in the join clause, which I wrote about here: http://www.sqlteam.com/article/additional-criteria-in-the-join-clause.

    I recently ran into this again, when debugging a query that joined to a junction table and needed additional criteria in order to return only one row instead of multiple (e.g. LEFT JOIN myJunction as J ON T1.Key1 = J.Key1 AND J.Key2 = ‘KnownLimitingValue’).

    A few months ago I also wrote on this subject, not so much from a performance point of view, but from an accuracy point of view. See http://weblogs.sqlteam.com/markc/archive/2008/11/11/60752.aspx.

  13. wardany says:

    thanks….this is very usefull

  14. Pingback: The effects UNION in a SQL query « WebbTech Solutions

  15. Pingback: The effects UNION in a SQL query | ButtonForums

  16. Suman says:

    Thank you very much for the details of DISTINCT clause! It gave a clear picture

  17. Joe Webb says:

    Thanks Suman. I’m glad you’ve found these posts useful.

    Joe

  18. Pingback: My Most Popular Posts From 2010 « WebbTech Solutions

  19. Nathan says:

    Thanks, I was having trouble at work trying to figure out why a DISTINCT keyword was increasing the time of the query in a MySQL db. This makes it much more clear why :)

  20. Sadia Aziz says:

    nice post! easy to understand with illustrations u gave :)

  21. Pingback: New Offer from Sony BDP-S590 3D Blu-ray Disc Player, watch video demo inside

  22. Good article. I definitely appreciate this website.
    Keep writing!

  23. Sagar says:

    hi,joe
    Awesome article about DISTINCT…
    Done best practical with Load testing and all…
    Really this article help me a lot…
    and I really appreciate your style of writing solution with this way..

    Thanks a Lot keep posting

  24. Thanks designed for sharing such a nice thinking, post is good,
    thats why i have read it entirely

  25. hard reset says:

    Very good article! We will be linking to this particularly great article on our
    site. Keep up the great writing.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 32 other followers