Page 1 of 3
Author: Robert Pearl
Audience: DBAs and developers
Reviewer: Ian Stirk
This book aims to ensure your SQL Server databases are healthy, how does it fare?
Robert Pearl sets out to provide you with the knowledge to ensure your SQL Server databases are healthy, this includes aspects of performance, maintenance, high availability and disaster recovery.
His book is targeted at database administrators and developers, additionally, system administrators might also find the content helpful.
Below is a chapter-by-chapter exploration of the topics covered.
Chapter 1 Introduction to Healthy SQL
The chapter opens with a look at the role of the DBA, being involved with backups and restores, creating database objects, and resolving performance problems. The importance of having a database that performs well, is auditable, and regularly maintained (i.e. is healthy), is discussed in terms of a SQL health check.
The chapter expands on the components of a SQL health check, including:
- Performance – primary reason for a SQL health check. Identify why queries are slow then improve them. Spot issues before they’re escalated. Proactive approach
- Security – the need to protect data
- Stability – apply service packs and relevant cumulative updates, in a timely manner
- Audits – prove the instance is healthy, at any time
- Backups – often first line of defence when have problems
- Business continuity – ensure systems can continue after a disaster
The chapter continues with a look at when to perform a heath check (now), highlighting the importance of creating a baseline – against which future changes can be compared. Helpful website links are provided (as they are throughout the book).
This chapter provides a helpful overview of what a SQL health check is, why it should be performed, some potential problems if it’s not performed, and some of the tools used. It also outlined what to expect from the rest of the book.
I found parts of the chapter chatty and informal (this occurs throughout the book), which may be suitable for some readers. The Surface Area Configuration tool is referenced as SAG, this should be SAC. There’s a nonsensical sentence that reads “I suggest that you the entire the book, at least once...” These problems should have been caught by the editors/reviewers.
Chapter 2 Creating a Road Map
This chapter begins with the need for a checklist, this will detail what you need to do and when it should be done. It can act as both a motivator and a schedule. It continues by defining various performance related terms that you’ll need to understand, these include: baseline, delta, deviation, trending, metadata, peak, and nonpeak.
Before the checklist can be completed, an inventory of the current system should be created. Useful SQL code is provided that examines the current system (e.g. SQL Server version, clustering). The Performance Tuning Lifecycle is briefly enumerated as: collect data, baseline, compare/delta, identify any health issues, establish best practices, determine changes/implement, repeat.
The purpose of the checklist is to gather data and document. There can be many different checklists e.g. pre-install list, service list etc. Such lists tend to evolve, and can be found on the web. The chapter then looks briefly at what information to collect, often CPU, memory, I/O, locking and blocking are important pieces of data. Such information can be collected from various internal structures such as static Catalog Views and Dynamic Management Views (DMVs).
The chapter then moves on to best practices – what they are and who says they are best practices. In many cases context is needed, else a best practice can become a bad practice. Some common erroneous best practices are discussed (e.g. a high CXPACKET wait state is always bad). Trusted sources of best practices include TechNet, MSDN, and MVPs – all have vigorous peer reviews.
Next, the runbook is discussed. This is a collection of documents, in various formats, that detail what to do when problems occurs. It contains: inventory, checklists, performance data, procedural and operational guides. It can help troubleshoot issues, ensure business continuity, quick recovery from a disaster, and help pass an audit. The chapter ends with some useful best practices to consider.
This chapter aims to show the need for various checklists, and outlines their content, and some possible methods of collecting the data. The outline content of the runbook should prove invaluable.
For me, the chapter felt a little unstructured, maybe because there is a lot to say on the subject, and the topics overlap and intertwine – but it does contain a lot of useful information. I expected various checklists to be provided, with SQL code, together with what are acceptable checklist data values, and what to do with unacceptable values – but instead you’re told to use the checklists on the web.
Chapter 3 Waits and Queues
“Waits are queues” is a well established technique for identifying problems with your SQL Server databases. When SQL code cannot run, the reason why it is waiting is recorded internally as “wait stats” (e.g. blocking), examining this data will identify your instance’s main problems, allowing you to make targeted improvements.
The chapter examines the DMV sys.dm_os_wait_stats to identify the most common causes of waiting. These waits stats can be correlated with Performance Monitor (perfMon) counters to corroborate the cause of the problem.
Microsoft engineers say the waits and queues technique provides the biggest bang for the buck in identifying problems. A link to Tom Davidson’s excellent waits and queues document is provided, it will help in determining why your system is slow and how to improve it. Additionally, there’s a link to a more recent waits resource from Microsoft’s CSS SQL Server engineers.
Some useful SQL code is provided to identify the top wait stats. A useful supermarket/customers analogy is provided, this is expanded upon when looking at the Execution Model.
The chapter then looks at CPU pressure, how this can be identified from the wait stats (i.e. wait type of SOS_SCHEDULER_YIELDS), and a relatively high value for signal waits. High CPU usage can have many causes including: many compiles/recompiles, missing indexes, and unnecessary sorting. Some recommended perfMon counters are provided, that can help corroborate your findings. Some useful code is given that shows how much CPU has been used, sampled each minute, over the previous 256 minutes, by SQL Server, other processes, and idle processing.
IO pressure is examined next. Night-time IO problems may be due to backups, maintenance, or a virus checker, all of which can be IO intensive. The impact of fragmentation on performance is briefly discussed. SQL code is provided to identify the disks with the worst latency, and most IO stalls. The section ends with a look at some common IO-related perfMon counters, and related wait stats.
Next memory pressure is examined. Sometimes lack of memory can show itself as an IO problem, since there may be insufficient memory to cache the data. The execution plan reserves an amount of memory, but it might not be immediately available, hence causes a wait. The section ends with a look at some common memory-related perfMon counters, and DMVs.
The chapter ends with a look at locks and blocking. Some useful code is provided to identify currently blocked requests.
This chapter provided useful detail on the “waits and queues” technique, a common way of determining why, at the instance level, SQL code is not running. CPU, IO, and memory problems are discussed, together with their related perfMon counters and waits. Useful web links are provided.
The diagram of the calculation of the Total Query Response Time is incorrect, since the wait time includes the signal wait time. The SQL code for the query that calculates "Total Query Response Time" is incorrect (it actually calculates the "Average Query Response Time"). The code uses “ORDER BY 2”, using the column name would make the code more readable. The section on memory is listed under the IO section, I suspect it should be a separate section.
The output is difficult to read (columns are not aligned), sadly this occurs throughout the book. Some wait types are given in upper case, and others in mixed case, consistency would be advisable. The summary says “...you will move on to the tools of the trade...”, however the next chapter is about indexes, chapter 5 is about tools. The text uses the word “Elided”, it’s an uncommon word, it means “Merged” – why use an unusual word when you can use a quotidian one?