Troubleshooting SQL Server - A Guide for the Accidental DBA

Article Index
Troubleshooting SQL Server - A Guide for the Accidental DBA
Chapters 2-4
Chapters 5-7
Chapters 8-9

Page 2 of 4

Chapter 2 Disk I/O Configuration

I/O is often the primary reason why database systems are slow. If you can improve your I/O performance, you’ll automatically improve the performance of most of your queries. The chapter starts by explaining the difference between capacity and throughput (e.g. 100 10GB disks v 1 TB disk). The various common RAID types are explained (0 – stripping, 1- mirroring, 5 [and 6] – striping with parity, 10 – stripping and mirroring). If your system is write-heavy then RAID 1 or 10 is preferred, for a read-heavy system level 5 or 6 is adequate. That said, you can split your files, so the transaction logs and tempdb are often better off on RAID 1 and 10 respectively. Some I/O tools are introduced, including SQLIO, IOmeter, and SQLIOSim. Perhaps some time could have been spent explaining the output produced (but there are links to related articles).

Direct Attached Storage (DAS) and Storage Area Networks (SAN) are discussed and compared. Owing to the complexity of the SAN, troubleshooting DBAs will need to develop a relationship with the SAN guys – there is often a conflict of interest because the SAN guys focus on efficient storage use, whereas SQL guys emphasize performance. Average physical disk access values over 20ms should be investigated. Solid states disks (SSDs) are only briefly mentioned, even though they are becoming increasingly common.

Common causes of disk I/O problems are given (and solutions proposed). These include:

Sizing for capacity instead of I/O performance
Incorrect workload isolation – don’t want to share too much with other systems
Incorrect partition alignment
Incorrect bandwidth using SAN configurations

A good chapter, obviously it can’t cover everything, but it covers the salient points.

(Click on book cover for free pdf download)

Chapter 3 High CPU Utilization

CPU problems are easy to spot but difficult to diagnose, for example missing indexes can result in excessive CPU usage. The real underlying problem can be discovered by using PerfMon, DMVs and SQL Trace. The use of PerfMon is briefly explained, together with the relevant CPU-related counters to monitor. SQL trace is discussed, with reference to investigating queries with high CPU usage. DMVs are similarly briefly mentioned, they accumulate data since the last reboot, they can be examined to retrieve the most intensive CPU related queries, at the individual SQL statement level, and additionally they are relatively unobtrusive. The common CPU-related wait stats are explained (SOS_SCHEDULER_YIELD, CXPACKET, and CMEMTHREAD)

Common causes of high CPU usage are given (and solutions proposed). These include:

Missing indexes – resulting in more data being read
Stale statistics – can result in an inappropriate query plan
Non-SARGable predicates – can prevent an index being used
Implicit conversions – can prevent an index being used
Parameter sniffing – inappropriate query plan is based on atypical parameters
Ad-hoc non-parameterized queries
Inappropriate parallelism – parallelism is often less useful for OLTP systems
Windows Server power saving option – the default is for lower power usage

This is another wide-ranging and detailed chapter, mostly tied together well, but sometimes a bit confusing. There are plenty of useful resources given at the end of this chapter if you want to go deeper.

Chapter 4 Memory Management

The chapter opens with the observation that forums often state SQL Server must have a memory leak since it doesn't release memory, however this is by design, since having data cached in memory makes subsequent access quicker (disk access is much slower than memory access). Memory usage is explained, as is 32-bit and 64-bit architectures. Typically memory problems are diagnosed using PerfMon and DMVs. The common memory related PerfMon counters are explained (buffer cache hit ratio, page life expectancy, free pages etc), but the memory related DMVs are only mentioned in passing.

Common causes of memory problems are given (and solutions proposed). These include:

Memory leak – this is not true, SQL Server is designed to hold onto its memory
Paging problems – large amounts of data are paged out
OS unstable – often because ‘lock pages in memory‘ not set, or unlimited server memory
App Domain marked for unload
Over-provisioned virtual server
Inappropriate memory settings for multiple instances

This chapter is wide-ranging, but less detailed than other chapters, and only really provides guidelines. At times the chapter seems muddled, owing to the tying together the disparate areas. There is less code in this chapter than other chapters.

Having examined the problems in terms of the common resources of disk I/O, CPU, and memory, the book now switches to other problem areas, namely missing indexes, blocking and deadlocks.

<< Prev - Next >>

Last Updated ( Thursday, 23 May 2013 )