Gert-Jan's SQL Performance - example 3

It is a common scenario to find table rows based on a free text search.

The typical example is a table (Test) with a clustered index on its Primary Key and a nonclustered index on the varchar column that we are interested in. Next, there is a query that searches all rows for a specific text.

CREATE TABLE Test
(id     int          IDENTITY  CONSTRAINT PK_Test PRIMARY KEY CLUSTERED
,t      varchar(700) NOT NULL
,filler char(90)         NULL
)

CREATE INDEX IX_Test_t ON Test(t)

SELECT *
FROM   Test
WHERE  t LIKE '%course%'

For this example, table Test was filled with 2,241 rows, with an average of 71 characters in column t.
The specified query will return just 2 rows.

SQL Server 2000

Since we are not talking about Full Text Indexing, there are only two strategies for this query. The optimizer can either choose to scan the clustered index, or to scan the nonclustered index and look up all matching rows in the clustered index.

The strategy to scan the nonclustered index only makes sense if the query returns just a few rows, because the nonclustered index is relatively large (about 50% of the size of the clustered index)

So this is what SQL Server 2000 does, regardless of the text we are looking for.

The clustered index is scanned, and its associated cost is 52 logical reads.

This is the safe query plan, and the best plan if the query returns more than a few rows. However, for this particular query - which returns only 2 rows - it would have been better to scan the nonclustered index.

SQL Server 2005

SQL Server 2005 introduces String Summary statistics. There are some specific requirements that determine whether the engine maintains this information, but basically they are created automatically when they could be useful.

So let's check if String Summary information is available in our example.

DBCC SHOW_STATISTICS(Test,IX_Test_t)

The (trimmed) first part of the output is shown below, and the last column shows that String Summary statistcs are available.

Name       Updated              Rows   Rows Sampled  Steps  Density     Average key length  String Index 
---------- -------------------- ------ ------------- ------ ----------- ------------------- ------------ 
IX_Test_t  Jul 20 2008  4:41PM  2241   2241          193    0.99199599  74.941544           YES

The following query plan is used for our example:

We see that the nonclustered index is scanned, and only the matching rows are looked up in the clustered index. The associated cost is 38 logical reads, which is 27% less than scanning the clustered index

If you know more...

This example shows how SQL Server 2005 will do better out-of-the-box then SQL server 2000. However, if you know that the query will only return a few rows, or if you are only interested in the first few rows, then there is a trick to help the optimizer: Simply make it a TOP (x) ... ORDER BY query!

SELECT TOP 100 *
FROM   Test
WHERE  t LIKE '%course%'
ORDER BY t

Now, it will execute very well on SQL Server 2000 too.

Now, the nonclustered index is scanned too, and the top 100 matches are looked up in the clustered index. The associated cost for this query: 41 logical reads.

LIKE '%foo%'

SQL Server 2000

SQL Server 2005

If you know more...