Page 1 of 7
Author: Dmitri Korotkevitch
Audience: DBAs, Database developers
Reviewer: Ian Stirk
Another SQL Server internals book hits the market, is it needed? Let's find out...
Understanding the fundamentals of SQL and SQL Server is relatively easy, however it soon becomes necessary to gain a deeper understanding in order to provide better solutions and improved performance.
With the recent publication of the notable Microsoft SQL Server 2012 Internals, see my review, I did wonder, like the author himself, if we needed another internals book. However, there is room for a more practical book, useful for both database developers and administrators.
The book covers various versions of SQL Server, from 2005 to 2014, with version specific information given where appropriate. There are 35 chapters divided into eight sections.
Below is a chapter-by-chapter exploration of the topics covered.
Chapter 1 Data Storage Internals
The chapter starts with a look at the file components of a SQL Server database, i.e. a log file, primary data file (.mdf) and zero or more secondary data files (.ndf). The importance of filegroups for improved administration is explained, as is the improved performance associated with having multiple disks. Database auto-growth, instant file initialization (IFI) and autoshrink are examined and best practices suggested.
Korotkevitch then moves onto explaining the physical structure of database pages (header bits, data/free space, and offset array), and the structure of individual rows (fixed and variable length datatypes and Large Object Store). Mixed and uniform extents are examined, together with allocation maps that keep track of extents and page use. GAM, SGAM and IAM are all explained.
The steps involved when data is modified are detailed, covering the physical disks, memory and the buffer pool. The update requires a synchronous write to the transaction log, while the data files are updated later, asynchronous, via a checkpoint process.
This chapter provides a detailed look of the database’s physical components and their structure. Throughout the chapter (and book), practical example code is given to support the assertions made. There are good links to other related chapters, and plenty of useful incidental tips along the way (again, throughout the whole book).
I particularly liked the code used to determine if Instant File Initialization (IFI) is enabled (uses a trace flag). There’s some useful code to examine a physical page via DBCC IND and DBCC PAGE, again all very practical. It might have been useful to mention compression here, but that’s a minor point.
Chapter 2 Tables and Indexes Internal Structure and Access Methods
The chapter starts by examining the structure of heaps, these are tables without a clustered index. They tend to be liberal with page space, and subsequent updates can lead to forward pointers, both can lead to increased I/O. Rebuilds will remove the forward pointers. Heaps are good when you want to import data quickly, however, if subsequent processing is required, a clustered index often gives better performance.
Clustered indexes ensure the data is sorted (note, the data on a given page may be unsorted, but the page’s offset array is sorted. The leaf level contains the actual data, the pages are linked via a double-linked list. A clustered index is particularly efficient with queries that require scans.
Non-clustered indexes define a separate sort order for the columns in a separate index. The example of a book is given to show the difference between a clustered index (sequential pages) and a non-clustered index (alphabetic index at back of book). The non-clustered index contains pointers to the underlying clustered index, showing the importance of a small clustered index. Non-clustered indexes are efficient when retrieving a small number of identifiable rows. Lookups to the underlying clustered index can be costly when a large number of rows are concerned.
This chapter contains a good description of the various types of indexes, their purpose and structure, with good supporting diagrams. It contains the best general description of low-level reads/write and problems of heaps I’ve seen. There is some great detail e.g. merry-go-round scan allows multiple tasks to share same index scan. There’s a good description of SARGable predicates, what they are and how they’re used.
Chapter 3 Statistics
SQL Server uses a cost based optimizer to create a ‘good enough’ execution plan. The optimizer uses statistics to obtain cardinality estimates, this affects the join type, if an index is used, how the index is used and memory required. Indexes automatically have statistics, these can be supplemented by user-created statistics.
The state of the statistics is examined using DBCC show_statistics, and the three grids output explained (e.g. last updated, sampling, and densities). Various conditions cause the statistics to get updated, however, stale statistics can lead to inefficient execution plans and thus slow queries, so it’s important to keep the statistics up to date. Various methods of manually updating the statistics are described. The DMV sys.dm_db_stats_properties can be used to monitor statistics changes.
The chapter describes the new cardinality estimator in SQL Server 2014, this produces better plans for non-correlated predicates or values that lie outside the statistics histogram. It is noted that the legacy cardinality estimator is better for correlated predicates, and a trace flag can be used to swap between the cardinality estimators used.
The chapter gives a good overview of the importance of statistics. Code examples illustrate the effect of stale statistics, and how the estimated and actual row counts in the execution plan are a good indicator of state statistics. Outdated statistics are a very common cause of slow query performance. There’s a very good example that illustrates the impact of state statistics on memory grants, leading to spilling to tempdb.
Chapter 4 Special Indexing and Storage Features
This chapter contains a miscellany of index and table related matter. The chapter opens with a look at include columns, these are non-key columns in non-clustered index that reduce lookups, thus improving performance. Filtered indexed are described, being an index with a WHERE clause, in essence producing an index on a subset of data, again potentially improving performance. Corresponding filtered statistics are also examined. The use of calculated columns to improve performance is illustrated in a series of practical tests.
The chapter then discusses data compression. Both row and page compressions are described in detail. Typically, compression results in an increase in CPU usage, however this is usually offset by improved I/O (more data per read), leading to an improvement in performance. The routine sp_estimate_data_compression_savings is used to determine if table or index compression is worthwhile. Lastly sparse columns are described, being optimized for NULL values, potentially giving significant space savings.
This chapter consists of a mix of topics, loosely framed around tables and indexes. There’s a useful discussion of when to use included columns (when have lookups), and filtered indexes. There’s useful code for showing the importance of compression. Table 4-3 seems to have its headers juxtaposed.