Page 3 of 5
Author: Kalen Delaney et al
Publisher: Microsoft Press
Audience: DBAs and SQL Developers
Reviewer: Ian Stirk
Chapter 7 Indexes: internals and management
Indexes are perhaps the primary tool for significantly improving performance. While indexes can speed data retrieval, they can be detrimental to data modifications (i.e. inserts, deletes or updates), so any index changes should be thoroughly tested.
The chapter starts with a comparison of the pages of a book and its index, to a clustered index and non-clustered indexes. Most indexes have a B-tree structure, consisting of a root, various intermediate levels and a leaf level. The idea behind the B-tree is the inverted tree structure is balanced, so there are an equal number of steps from the root node to any leaf level data. Examples are provided of indexes with a large key column and another with a very narrow key column, with the aim of seeing the advantage of a narrow key when looking for data.
Two DMV are discussed as tools for analysing indexes. The first DMV is sys.dm_db_index_physical_stats which details the structure and health of an index, example SQL is provided to use this DMV. The second DMV is sys.dm_db_database_page_allocations which shows very detailed output for every page used or allocated to the databases, objects, or indexes specified.
The chapter continues with a discussion on the B-tree structure, firstly showing at the leaf-level the clustered index is the actual table data. The clustered index is unique either because it is also a primary key, else it contains a unique internal identifier. All columns of the clustered index are included in any non-clustered indexes, hence the desire to keep the clustered key narrow when possible. Non-clustered indexes are then discussed, in essence they contain duplicate data, and they can fulfil a query directly (a covered query) or may require access to the underlying table (bookmark lookup). By default, a clustered index is created when you create a primary key. This section is very much descriptive.
The chapter again becomes very detailed in discussing the physical index structures for B-trees. Index row format, for both clustered and non-clustered indexes, are examined using the DMVs sys.dm_db_index_physical_stats and sys.dm_db_Database_page_allocations, together with the DBCC PAGE command. Plenty of sample code is given and output explained.
Most of the chapter up to now has related to tradition row based B-tree indexes. The emphasis now changes to columnstore indexes. Columnstore indexes are column-based instead of row-based, and because similar data is stored together, compression and in-memory technology can provide large performance improvements. In SQL Server 2012, columnstore indexes are read only, and especially suitable for data warehouses (note in SQL Server 2014, these columnstore indexes will be updateable). A step-by-step example of creating a columnstore index is given, and examined with various system views.
Overall a very interesting chapter, this is what I expected an internals book to be about. The chapter describes how data is physically stored in the various index and table structures. This chapter is very deep, with several sections requiring more than one reading (for me at least). Some very sensible advice is given about there being no absolutes in choosing a clustered key, just some best practices that work for most tables.
I had expected something about the importance of statistics in this indexes section, and while this is given later in Chapter 11 (The Query Optimizer), no cross-reference is made. While foreign keys where discussed, I would have thought it worthwhile pointing out that by default foreign keys don’t have indexes on them.
Chapter 8 Special storage
The previous chapters have looked at the storage of typical data that fits into the 8KB page. This chapter now discusses all the other types of data i.e. row-overflow, LOB, FILESTREAM and FileTable. Additionally, various storage features are also examined here, including sparse columns, compression and partitions.
The 8KB page size limit can be exceeded with variable-length columns, e.g. varchar. Some example code is provided to examine the data using system views, including row_overflow_data, again the DBCC PAGE command can be used to examine detailed data at the page level.
A more admin how-to approach to FILESTREAM follows, including its enabling, creation, modification, and logging – in essence a step-by-step approach. Metadata used to examine FILESTREAM is also described. Similarly the FileTable is also described as a way of extending the FILESTREAM capabilities. The section ends with a review of performance of FILESTREAM data.
Sparse columns are discussed next, these conserve space when most of the rows are null. A null takes no space, however non-null values are both more expensive to process and require more storage. Sparse columns work best when around 90% of a column’s values are null. Examples are provided to examine sparse columns using system views.
The chapter proceeds with data compression. Data compression means more data can be stored on a page, so you typically needs less I/O to get your data, this is especially useful for data warehouses, or indeed systems that scan large amounts of data.
Row compression is described as compression that fits data into smaller data types, whereas page compression involves repeated patterns. Rows are examined with both row and page compression, again with plenty of detail. System tables are used to examine the limited amount of compression metadata available.
The chapter ends with a discussion of table and index partitioning. Tables and indexes that are not explicitly partitioned, have a single partition. Partitioning typically improves data management, but may also improve performance. The section begins with a how-to section, and proceeds to examine the metadata for partitioning. The advantages of the partition’s sliding window are illustrated.
This was a mixed chapter, containing both an admin how-to approach together with page-level physical detail. It was also mixed because, it contained details of both storage of non-standard rows and details of partitioning, and compression – it could be argued the latter belong to their own separate chapter.
Chapter 9 Special Indexes
A previous chapter looked at the more typical row and column based indexes i.e. B-tree indexes. This chapter looks at other types of index, specifically XML indexes, spatial indexes, full-text indexes, and semantic indexes.
The chapter discusses the creation of XML indexes, both a primary index (clustered index), and secondary indexes (types are: PATH, PROPERTY, and VALUE). Example code is provided to supplement the discussions. The use of XML indexes in query plans is examined and explained. The use of XQUERY to navigate and obtain data is discussed.
The chapter continues with spatial indexes, which are based on the spatial GEOMETRY and GEOGRAPHY data types, which unlike B-tree indexes have at least two dimensions. The use of spatial indexes in query plans is examined and explained.
The section of full-text indexes describes its components, and internal tables. This is followed by details of metadata used to view the indexes. Example code is provided to create a full-text index.
Lastly semantic indexes are covered, these are new to SQL Server 2012, and they use full-text keywords to build on an index to enable searching over unstructured data.
In summary, this chapter describes the special indexes, provides SQL to create them, their existence in query plans is explained, and metadata used to examine them is given.
In the preceding chapters the emphasis was on the different types of storage, how they’re physically stored, and how you can use metadata to view them. The next few chapters now describe the process of query execution, query optimization, and plan caching.
Chapter 10 Query execution
The query optimizer is responsible for creating good execution plans, which are then executed by the execution engine. This engine retrieves and updates data in tables and indexes, and implements joins etc. By examining the query plans it is possible to understand what the engine is trying to do, and perhaps influence its behaviour.
The chapter starts with a look at the basic component of query processing, the operator (also called iterator). These perform discrete functions such as scan a table, filter data, or update data. Being discrete they are easily reusable. Iterators can contain children, thus giving a tree-like appearance in execution plans. There are often many ways of fulfilling a query, thus various operators can potentially be used to create an execution plan.
Iterators are typically divided into blocking and non-blocking. Blocking iterators process all the input rows before producing output rows (e.g. sort), whereas non-blocking iterators consume input rows at the same time as producing output rows (e.g. a scalar function). Blocking iterators often consume memory, and sometimes cause ‘interesting’ memory spill problems.
The chapter proceeds with a section on reading query plans. There are three options to read these, graphical (easiest on the eye), text (very useful when the query plan is large), and XML (easiest for processing programmatically). The differences and advantages of each of the three types of query plans are explained, as well as how to read each of them (there is a lot of useful information if you know where to look). The difference between estimated and actual plans is discussed. Various SHOWPLAN and statistics commands are explained (e.g. SET STATISTICS XML ON).
In analyzing plans, only the more common operators are explained, however from this you should be able to deduce how the more complex operators work. Scans, seeks, bookmark lookups, and joins are very common functions and are examined in detail. The section dives deeper with a look at aggregations, unions, and advanced index operations (i.e. dynamic index seeks, index unions, and index intersections). Subqueries, parallelism and data modification are also examined in detail.
The chapter ends with a look at columnstore indexes, these have a newly introduced batch processing technology (as opposed to row-by-row processing), example code is provided and the associated query plan discussed.
This chapter contained a mix of discussion, practical example code, and deep dives into what the optimizer is doing. It explains the basics of query processing, explaining the basic iterators, explores how to examine the query plan using the three options (GUI, text, and XML). It explains that most of the time the optimizer produces good plans, but sometimes gets it wrong.
I did wonder if Chapter 11 (The Query Optimizer) should come before this chapter on query execution, that said there is much in common between the two. The example code provided is based on a downloadable NorthWind database making it easy to follow. Acknowledging there is no single best join type, there is an excellent summary grid that compares the properties the three join types and when each is perhaps best used.
There are several good incidental tips including:
if the row counts on actual and estimated query plans differ significantly, it’s likely the statistics are stale
the ‘cost threshold for parallelism’ is rarely used, but can be very useful
using an INCLUDE can reduce bookmark lookups