SQL Server Advanced Troubleshooting and Performance Tuning (O'Reilly)

Article Index
SQL Server Advanced Troubleshooting and Performance Tuning (O'Reilly)
Chapters 3 - 6
Chapters 7 - 9
Chapters 10 -13
Chapter 14 on; Conclusion

Page 5 of 5

Chapter 14. Database Schema and Index Analysis

The previous chapters have looked largely at changes outside tables that may improve performance. This chapter now looks deeper at changes to tables, heaps, and indexes that might improve performance.

The chapter opens with a look at database schema analysis. Various catalog views (e.g. sys.tables) can be used to explore the schema, a useful diagram of these is provided together with their relationships. Examining the schema can identify the cause of various database issues including:

Heap tables - tables generally perform better if they have a clustered index
Indexes on uniqueidentifier columns – these often lead to page splits, using NEWSEQUENTIALID() can help
Wide and nonunique clustered indexes – ideally a clustered index should have a narrow, static, unique value
Nonindexed foreign keys – by default SQL Server doesn’t index foreign keys (FKs). These are typically used to join a child to a parent table and for referential integrity. Indexing FKs usually helps improve performance
Redundant indexes – sometimes systems have the same or near-identical indexes, these can increase transaction duration and blocking

In all cases, useful illustrative code is provided to support the discussions and assertions.

Next, there’s a useful section on index analysis. Indexes can help with performance, but they can increase transaction costs, and blocking. Two useful DMVs for analyzing index usage are discussed, namely:

sys.dm_db_index_usage_stats - query level, how many times queries used an index, data grouped by access method (e.g. seeks), and how often updated
sys.dm_db_index_operational_stats - row level, low level stats on access method, locking, latching, IO, etc. great for troubleshooting index performance, locking, latching problems

The author discusses his own utility sp_Index_Analysis, this joins various views together, showing: size of index on disk and in buffer pool, index usage stats, index operational stats, stats info and more.

Perhaps this is the place where the author should make reference to the Database Engine Tuning Advisor, which aims to look at your existing data structures, and the existing query workload (e.g the cache plans or Query Store or a Trace), and determines the optimal data structures for that workload.

Reading this chapter in particular, I could see many of the utilities I’ve created myself, and reading the author was like examining the thoughts I’ve had over the years (e.g. Data Type Mismatch utility). This is another very useful chapter, perhaps it should be merged with the other chapter containing index information. I’m not sure if this is the best position for this chapter within the book.

Chapter 15. SQL Server in Virtualized Environments

The use of virtualization is now relatively common, it adds a layer of complexity but provides various advantages (e.g sharing resources). The author acknowledges this chapter is not comprehensive but should provide enough information to talk with the virtualization infrastructure team about concerns.

The chapter opens with a look at the pros and cons of using virtualization. The advantages include: reduced infrastructure costs, simplified maintenance, resize easily, and can give another layer of HA. The disadvantages include: performance overhead (can be 10%+ on larger systems), a more complex system.

Next, there’s a look at configuration, including:

Capacity planning – Admins often want as many Virtual Machines (VMs) as possible on a host. While it may be acceptable to overprovision resources on non-production systems, on production systems resources should be more considered
CPU configuration – adding CPUs might degrade performance, having fewer CPUs can outperform having more CPUs if the available CPUs can cater for the workloads
Network – it’s suggested there is a separate vNet for the heartbeat check, and infrastructure should prioritize this

The chapter ends with a look at how to troubleshoot in the VMs environment. Both the guest VM and the host load need to be examined – here it’s important to get the infrastructure team involved. Common problems briefly examined, including: insufficient CPU bandwidth, memory pressure, and IO performance.

This chapter provides a useful introduction to virtualization, it’s importance, advantages and disadvantages. A very good piece of advice is to get the infrastructure team involved.

Like the author, I thought overprovisioned meant something like you provision 110% of available resources, but it seems virtualization experts take overprovision to mean when the amount of resource allocated is 3 times greater than the actual resources!

Chapter 16. SQL Server in the Cloud

The cloud is very popular, businesses realize there are advantages in using the cloud over on-premise SQL Servers. Luckily, the same base SQL Server engine is used on both on-premise and the cloud, so many of the tools and techniques for troubleshooting are familiar.

The chapter opens with a high-level look at the cloud platforms. There’s a useful reminder that there isn’t a cloud, it’s just someone else’s data center. All major cloud vendors can provide routine database maintenance and HA.

The cloud providers offer a simple, cheap lift-and-shift migration approach via VMs, however, the author suggests this initial low cost might be misleading, he suggests later when the system needs significant changes there may be expensive re-architecture costs.

There’s a list of some typical differences between on-premise and cloud, including: reliability (outages will happen), and throttling (and be aware of latency).

There’s a useful section on connectivity errors. I must admit it is quite common to see these on new systems or with new users, so this section gives good advice on common problems and their solutions.

Next, we look at SQL Server in cloud VMs. In essence these are similar to virtualized on-premise SQL Servers but hosted by the cloud provided. The capabilities of SQL Server are the same as on-premise, where you have full control over the environment. These are often used in a lift-and-shift migration. Some differences do exist and are discussed (e.g. cloud storage can handle bursts of IO activity).

Following on, we look at Managed Microsoft Azure SQL Services. This is broken into 2 areas, the Azure Managed Instance and the Azure SQL Database. Azure SQL Database provides a database only, which is patched, updated, backed-up, and has HA, all provided automatically, note you are abstracted from the OS, you cannot see the file system etc. Azure Managed Instance provides a halfway house between Azure SQL VM and Azure SQL Database, you can run SQL Agent, and access the OS.

Approaches to troubleshooting are discussed, and typically involve using the Azure portal, XEs, and/or catalog views and DMVs. There is a facility to automatically add required indexes. There’s a brief look at Amazon SQL Server RDS, and Google Cloud SQL.

This topic can be a big area when just considering the Azure SQL Server offering, add in the offerings from other venders and it can quickly get complex and confusing. Perhaps it may have been better to concentrate on the Microsoft Azure cloud offerings.

Appendix A. Wait Types

The appendix provides a central area to review the waits. Details are provided on the typical cause of the wait, together with some possible solutions, finally a reference to the relevant book chapter is included.

Conclusion

This book aims to improve the performance of your SQL Servers, and certainly succeeds. The book is well written with a good flow between the topics, having useful discussions, diagrams, code, links to other chapters, and web-links for further information. Each chapter ends with a short summary, and a VERY useful and instructive troubleshooting checklist.

The author has written several books on SQL Server performance, all are detailed with excellent content. Here, the author aims to update the content to cover the later versions of SQL Server, while taking a practical problem/solution approach. Although the book covers internals, it’s more concise and practical than other offerings.

The author takes a holistic approach to the problem solving, providing cross-references to other methods where possible. The author is not dogmatic, but rather knows things depend on system circumstances. There are some general rules, but you’re advised not to follow these blindly because there are often exceptions too. As always, it’s best to test things out on your own systems.

The book is aimed at various database professionals, but primarily DBAs and database developers. You might need a few years of SQL Server experience to get the most out of this book.

Although the book is largely concerned with SQL Server 2022, 2019, and the cloud, much is applicable to older versions, from SQL Server 2005 onwards. The book’s companion website contains the book’s code and more (e.g. some code for earlier versions of SQL Server).

The book contains plenty of sage advice and practical code, won from years of investigating and fixing SQL Server performance problems.

This is now my favorite SQL Server performance book. If you have SQL Server performance problems, you need this book.

For recommendations of related books see Pick Of The Shelf - SQL Server in our Programmer's Bookshelf section.

To keep up with our coverage of books for programmers, follow @bookwatchiprog on Twitter or subscribe to I Programmer's Books RSS feed for each day's new addition to Book Watch and for new reviews.

The Rust Programming Language, 2nd Ed

Author: Steve Klabnik and Carol Nichols
Publisher: No Starch Press
Date: June 2023
Pages: 560
ISBN: 978-1718503106
Print: 1718503105
Kindle: B0B7QTX8LL
Audience: Systems programmers
Rating: 4.8
Reviewer: Mike James

There's a new edition of what has become the standard text on Rust. Has it matured along with [ ... ]

+ Full Review

Grokking Machine Learning

Author: Luis G. Serrano
Publisher: Manning
Date: December 2021
Pages: 512
ISBN: 978-1617295911
Print: 1617295914
Kindle: B09LK7KBSL
Audience: Python developers interested in machine learning
Rating: 5
Reviewer: Mike James
Another book on machine learning - surely we have enough by now?

+ Full Review

More Reviews

Chapter 2. SQL Server Execution Model and Wait Statistics

When SQL Server executes SQL code, there’s a very useful side-effect that can help determine the main issues with the instance. Code is typically executing on the CPU else it is waiting (e.g. waiting for IO to complete), SQL Server records the types of wait and their duration. Analyzing these waits (called wait stats) to understand the main concerns with the instance is a well-known troubleshooting method.

The chapter explains how SQL queries are executed, and how the various types of waiting can be used to investigate problems with the instance. The chapter opens with a look at the high-level architecture of SQL Server’s major components, including: protocol layer, query processor, storage engine, and the in-memory engine. There’s a useful diagram showing how the components interact to execute a query.

There’s a helpful discussion on how the SQL Operating System (SQLOS) is involved in scheduling, monitoring, exceptions, resource usage etc. The various statuses of the execution of the SQL code are described, these can include:

pending – waiting for worker, often quick, else doing too much

running – executing on scheduler

runnable – waiting for scheduler

suspended – waiting on external event or resource (waiting for page from disk)

spinloop – processing a spinlock, often very quick

The Dynamic Management View (DMV) sys.dm_os_wait_stats can be queried to determine the waits on the instance. The author notes these should be cross-checked with other tools (e.g. Performance Monitor, Extended Events [XEs]) where possible – remember the initial symptoms may mask the real problem. Some useful code is provided to decipher these waits (and filter out innocuous wait types).

Next, there’s a look at some other DMVs associated with executing queries, useful code is provided for each. These include:

sys.dm_os_wait_stats - the waits for the instance

sys.dm_exec_session_wait_stats - from SQL Server 2016, similar to sys.dm_os_wait_stats but for a given session

sys.dm_os_waiting_tasks - shows tasks waiting on suspended queue, especially useful when system overloaded or has blocking

sys.dm_exec_requests - show’s what SQL code is running now

The chapter ends with a look at the Resource Governor. This feature allows different users or groups to have their resources limited e.g. allow application users to use more CPU than report users. These resources include CPU, memory, IO and MAXDOP.

This chapter might initially seem to be a little removed from performance tuning, however, an understanding of the underlying execution model clarifies the wait stats troubleshooting method. Some very useful code is provided to investigate the underlying concerns of the instance, and what’s currently running on the instance.

Chapter 3. Disk Subsystem Performance

The previous two chapters have set the background, this and the next few chapters look at troubleshooting from a resource problem perspective. Traditionally, many SQL Server systems have been IO bound, you would expect a modern system to have average disk times of 1-2ms for log files, and 3-5ms for data files.

The chapter opens with a look at the various components of an IO subsystems, including the buffer pool cache, data disks, and the log file. Data is read from the disks into the buffer pool, where the application interacts with it, any changes are written to the buffer pool before being written to the log, and eventually to the data files. Latches are used to protect data in memory. A useful diagram describes this processing.

The author encourages a holistic approach to IO problems. It is common for the Network team and the DBAs to argue about where the IO problems lie. Various tools can be used to simulate a SQL Server workload, and DiskSpd is discussed. The DMV sys.dm_io_virtual_file_stats is often the starting place to identify the database and data/log file having the biggest problem. It is possible to obtain the queries using the most IO, and these should be investigated for improvement (e.g. missing indexes).

DMV metrics are typically accumulated since the last restart or reboot. It is possible to create a DMV snapshot of values for a given time period. Here, a snapshot of the current DMVs is taken, then the query you want to investigate is run, then a second DMV snapshot is taken, and finally the delta between the 2 snapshots is taken. Code for this is provided.

Additionally, Performance Monitor counters and OS metrics can be investigated. A list of the more important counters is provided, here these mostly relate to the physical disk.

Where possible the DBA should talk with network team, various checks can be undertaken, including:

the VM host is not overprovisioned

HBA queue depth is optimal

vendor’s storage optimization recommendations

The chapter ends with a look at some of the more common IO related waits, what they mean, and how they might be fixed.

This chapter provides a very useful look at how to identify IO related problems, together with their potential solutions. Use code is provided throughout.

Chapter 4. Inefficient Queries

Many of the chapters look at problems relating to components, like CPUs, memory, and IO. In each case, the underlying problem might be inefficient SQL code, this could be due to missing indexes, stale statistics, etc. Inefficient queries can increase IO, memory usage, CPU usage, and blocking.

It may be possible to improve query performance by increasing the amount of hardware, however, while this may be an answer in the short-term, this tends not to be scalable and often the problem can return. The author highlights that performance tuning may not give the best Return on Investment (RoI).

Code is provided to find the top 50 queries using the most CPU, and another for finding the most IO. However, these queries refer to average CPU and average IO usage, I do wonder if these queries should be sorted by total CPU and total IO, rather than the average. If a query takes 20mins of CPU to run, and runs twice, its average duration is 20mins, if another query take 2 mins of CPU to run, and is run 1000 times, its average duration is 2 mins, but if you can improve the latter query by 50% you save 500 mins, whereas improving the first query by 50% you save 20mins, so I think we should focus on the total value rather than the average.

There’s some useful code for looking at related queries, those which are stored in the cache separately, but which really could be the same i.e. they have the same hash. Often you must enable forced parameterization to reuse the plans (which saves space and can improve performance significantly).

There’s a useful reminder that getting performance data can be expensive, so it makes sense to run these queries with the OPTION (MAXDOP 1) hint. DMVs do not record all queries since some are not cached (e.g. those with RECOMPILE). That said, using DMVs is often a useful fast and ‘good enough’ initial method for investigation of performance problems.

For more exact methods, you might want to look at Extended Events (XEs), SQL Traces, and Query Store. these are examined next. The use of SQL Trace has been superseded by XEs, so only XEs are discussed here. Contrary to most information sources, which regards XEs are lightweight, the author suggests XEs can sometimes be demanding on resources, so should be used carefully. Useful code is provided to capture and process queries with high CPU and IO usage.

DMVs that use cached plans can omit queries, and XEs and SQL Traces can be expensive to run. The author suggests using Query Store is the solution. Query Store does have some overhead, especially if you capture details of all the queries (it may be preferable to ignore queries with little resource usage). Brief details are provided on the 6 Query Store reports.

The data in the Query Store can be accessed via system views, these are discussed together with a useful diagram. Again, useful code to obtain the 50 most IO intensive queries is provided (this can be altered for CPU usage etc). There’s a very useful tip about using DBCC CLONE DATABASE, this copies a database’s structure, without its data, but includes its Query Store, allowing analysis of data away from a production database.

This chapter provided a useful overview of various built-in tools for investigating performance, with helpful code provided throughout. I’m not sure why this chapter was placed here, I would have expected chapters on CPU, memory, IO, network, to be adjacent to one another.

Chapter 5. Data Storage and Query Tuning

Understanding how data is stored, should help you understand how data is retrieved and modified, enabling you to identify areas for performance improvements.

First, there’s a look at the 3 types of data storage: rows (traditional), columns (typically optimal for OLAP), and in-memory (typically optimal for OLTP). Most of the chapter focuses on the traditional row-based storage (i.e. heaps, clustered indexes, non-clustered indexes). Some common problems are identified, including: excessive lookups, mismatched data types, and non-SARGable predicates (the latter typically means the index cannot be used). Some tips relating to composite and non-clustered indexes are provided. There’s some useful diagrams and discussions relating to table/index navigation.

Next, there’s a look at index fragmentation, which typically means it takes longer to retrieve the data. Its causes and possible solutions are outlined (e.g. FILLFACTOR of 85% can often reduce page splits). It’s recommended to convert heaps to clustered indexes. The author encourages you to investigate and use Ola Hallengren’s SQL Server Maintenance Solution, which includes index maintenance scripts/jobs.

Database statistics record the density and distribution of column values, typically of indexes. It’s important to have up to date statistics, since they are used by the optimizer to estimate the efficiency of data access, this includes whether and index is used, and how it will be used (i.e. seek, scan, lookup). Sometimes the default statistics update method is suboptimal, and a proactive regular update stats job may be preferable.

From SQL Server 2014 and onwards, there’s a new Cardinality Estimator, this uses the stats and other inputs to estimate the number of rows retrieved. This new Cardinality Estimator typically improves performance, but some queries can regress, these can be changed to use the legacy Cardinality Estimator. SQL Server 2022 can use Cardinality Estimator feedback to dynamically determine the correct Cardinality Estimator to use.

There’s a useful section on how to analyze Execution Plans, showing both row mode and batch mode execution. There’s a selective examination of some of the more useful operators. The author suggests SolarWinds Plan Explorer is a must have free tool for examining execution plans.

Discussing the storage background, leads to a very useful section on common problems and inefficiencies. These include inefficient code, incorrect join type, and excessive key lookups. In each case the problem is discussed, and some potential solutions offered.

The chapter ends with a reiteration that indexes and workloads should be looked at holistically. When I read this, I was expecting it to lead to a discussion of the Database Engine Tuning Advisor (DTA), which can help determine your ‘optimal’ data structures based on a workload – however it was not discussed.

This was a useful chapter, showing how tables, heaps, clustered and non-clustered indexes, can impact performance. Common problems are identified, and some very useful tips and solutions are provided.

Chapter 6. CPU Load

In the past, I’ve found that most SQL Server systems were limited by their IO subsystem, but in recent years, I’m finding that IO subsystems are improving, and the bottleneck is now often related to CPU and/or memory.

A high number of logical reads (i.e. data read from the memory) can use a lot of CPU, and these reads can be due to nonoptimized queries. Similarly, row by row cursor processing can also impact CPU usage. Query Store and the DMV sys.dm_exec_procedure_stats can be used to identify the most CPU intensive queries, these should be examined for improvements (e.g. missing indexes, stale stats).

Some useful scripts for troubleshooting high CPU load are provided (e.g. last 256 mins of server CPU usage). Advice is provided on some common nonoptimized query patterns to look for, including queries that scan millions of rows, sorting, and aggregation.

Next, query compilation and plan caching are discussed – both can impact CPU load. The issue of parameter sniffing is examined, with useful example code, various solutions to address it are examined (e.g. disabled database level parameter sniffing).

Following on, compilation and parameterization is discussed. The plan cache can contain many plans for ad-hoc queries that are really the same. Enabling Forced Parameterization can fix this problem, and reduce compilation/recompilation, and thus CPU load. You should examine the type of plans in the plan cache, to determine if forcing parameterization could be advantageous (code is provided for this).

The chapter ends with a look at Parallelism. Queries can often run faster if they are run across several CPUs. However, there is a cost associated with splitting and then merging/aggregating the results back together. On OLTP systems, this overhead of parallelism can be a problem. Two setting are involved with parallelism:

max degree of parallelism (MAXDOP) - specified the number of CPUs that can be used when a query executes in parallel

cost threshold for parallelism – the optimizer produces a cost for the query, if this is greater than the value of cost threshold for parallelism then the query can run in parallel

On modern systems, the default value for cost threshold for parallelism is 5 (seconds), this is too low, the author suggests a value of 50 should be used. Some DBAs set MAXDOP to 1 to get rid of the parallelism waits, however this only hides the problem, and limits the use of CPUs.

There’s a very useful suggestion of enabled Forced Parameterization and disabling Parameter Sniffing (with the latter being a basic problem of cached plans).

I found this a very useful chapter, especially with the growing number of systems having CPU related concerns.

Chapter 7. Memory Issues

Probably the first thing to note about SQL Server is it likes to use lots of memory – this reduces both IO and recompilations.

The chapter opens with a look at memory usage and configuration. Memory usages can be seen via various Performance Monitor counters, DMVs, and the system_health XE, details of each are provided. There’s a very useful calculation for ‘max server memory (MB)’, but be aware of other software on the box that will also need some memory (e.g. SSRS). The author notes that adding more memory is often the fastest and cheapest way to improve performance, since memory is relatively cheap. Defragmentation and compression can also improve memory usage.

Next, there’s a very useful section on memory clerks, these are very useful for analyzing the different types of memory usage. For example:

memoryclerk_sqlbufferpool – buffer pool memory, usually large

cachestore_phdr – internal objects used during query compilation

cachestore_objcp – stores compiled execution plans for stored procedures, functions etc

cachestore_sqlcp - – stores compiled execution plans for ad-hoc queries, often CPU intensive

Code is provided to calculate how much space each memory clerk has. The value of cachestore_sqlcp should be small compared with memoryclerk_sqlbufferpool, else it suggests there are many plans for ad-hoc queries, and Forced Parameterization can help reduce the memory usage and improve performance.

There’s a useful section on query execution and memory grants. Sometimes the optimizer calculates a bad value for memory to use for the query, this can result in spilling to tempdb with an associated decrease in performance. Later versions of SQL Server have a feedback mechanism to help correct this.

The chapter ends with a look at in-memory OLTP memory usage. It’s noted that in-memory will go readonly if there isn’t enough memory. Ideally, in-memory will keep only the latest data, and historical data can be archived off onto disk, and both in-memory and disk data can be joined via views – details of this are provided in the author’s other book concerning in-memory OLTP.

This chapter contains plenty of code to support the author’s discussions, and as always, there are plenty of useful incidental tips.

Chapter 8. Locking, Blocking, and Concurrency

Locks allow data consistency. This chapter focusses on disk-based b-tree tables.

The chapter opens with a look at the different types of lock, and the major ones are discussed (e.g. X, S, U, IS, IU, IX). There’s a helpful look at lock compatibility (e.g. exclusive locks are incompatible with other locks), before moving on to how transaction isolation levels impact locking behavior. There’s a useful recommendation about using READ COMMITTED SNAPSHOT instead of READ UNCOMMITTED (or NOLOCK), however this uses tempdb, so you need to monitor tempdb usage.

Next, the chapter moves on to blocking issues, where multiple sessions want the same resource and the locks are not compatible. Often the problem is due to inefficient queries acquiring locks unnecessarily – the author gives some useful examples of this, together with some solutions. There’s a very good point about the statement that you see causing the blocking may be incorrect, since the lock may have been acquired earlier. There’s a good overview of the Blocked Process Report, and the author extends this with his own Event Notifications and Blocking Monitoring Framework.

Deadlocks are examined next. These are special cases of blocking, where SQL Server terminates one of the sessions. Perhaps the author could have added that deadlocks are blocking where time will not solve the problem. There’s an excellent look at troubleshooting deadlocks, together with suggestions on how to resolve them. Often, adding a non-clustered index can fix deadlocks. It might have been useful to include code that decodes the deadlock.

The chapter continues with a look at optimistic isolation levels, here old versions of the data that is being modified is stored in tempdb, and other sessions can read this, avoiding blocking. Again, there is a need to monitor tempdb usage. The optimistic isolation levels READ COMMITTED SNAPSHOT and SNAPSHOT are discussed, highlighting their differences.

The chapter ends with a look at some of the more common lock wait types, together with suggestions for the waiting - often it is due to suboptimal queries.

This chapter looked at a very common area of concern. Useful code was provided to illustrate blocking and deadlocks, together with some very helpful solutions. For additional deeper information, it’s suggested you read the author’s book on concerning transactions and locking.

Chapter 9. tempdb Usage and Performance

The tempdb database is typically used by all the databases on the instance, it’s also used by SQL Server itself. If there are problems with tempdb, it can affect all the other database on the instance.

The chapter opens with a discussion of temporary objects, namely temporary tables, and table variables. Best practices for the objects are highlighted (e.g. take advantage of indexing on temporary tables – but be sure to test they are effective). There’s a useful point about using temp tables to break down complex queries, with the benefit that temp tables have automatic statistics created.

Next, there’s a look at some internal tempdb consumers, including version store and spills. Tempdb can be used to store previous versions of rows that are currently being updated, this allows SELECTs to run without blocking modifications. Sometimes, the calculation for the size of memory required by a query is incorrect, this can lead to spilling data to tempdb at runtime, solutions to this are examined.

Some common tempdb issues are discussed, with system page contention being a typical problem, solutions to this are given (e.g. enable trace flag T118). Another issue can be the running out of space, so monitoring is needed.

The chapter ends with some useful tempdb configuration recommendations. These include putting tempdb on a separate fast drive, and create multiple data files (of the same initial size). A rule of thumb is provided to calculate the recommended number of tempdb data files.

Since tempdb is shared by all the databases on the instance, and used by SQL Server itself, it’s important that it is optimized. This chapter provides a look at the common problems together with useful solutions.

Chapter 10. Latches

Latches are similar to locks, locks provide transaction data consistency, whereas latches prevent corruption of data structures in memory. Latches are typically short-lived, but they can be a problem as the load increases.

The chapter opens with a look at what latches are and the different types (i.e. keep, shared, update, exclusive, destroy). When a latch cannot be obtained on an object, a latch wait is generated, examining these waits can give an indication of the problem. The latch-related waits examined are:

PAGEIOLATCH – waiting for data to be read into buffer from disk, often due to poor IO subsystem and/or poor SQL code.

PAGELATCH – waiting because need to update buffer pool and allocation map pages – often due to tempdb system catalog contention

LATCH – all other latch waits not related to buffer pool

Some useful code is provided to illustrate latches that occur due to hotspots on tables where many sessions are making updates. Various solutions to this problem are given (e.g. use in-memory OLTP which doesn’t have latches).

The author states that documentation of latches is relatively sparse, and then provides a useful but brief explanation of other wait latch types.

This chapter provided a useful look at problems that occur relatively infrequently, but when they do occur, they often need to be solved.

Chapter 11. Transaction Log

Each database has a transaction log, this can perform point in time recovery, restore from a crash, and rollback of the current transactions.

The chapter opens with a look at the internals of the transaction log (e.g. Log Sequence Number, Write Ahead Log). There’s a useful walkthrough on data modifications and how this is reflected in the transaction log. Some common log waits are examined, and solutions given (e.g. WRITELOG). The use of implicit and explicit transactions is explored, with their impact on performance and log size. Delayed durability can improve performance but at the cost of some data loss, generally the author prefers to avoid it, and instead focusses on other log improvements.

Next, common problems related to log truncation are discussed. The column log_reuse_wait_desc in the system view sys.databases contains the underlying reason why a log is not truncated. Some of the common values are discussed including: LOG_BACKUP, ACTIVE_TRANSACTION, AVAILABILITY_REPLICA, and ACTIVE_BACKUP_OR_RESTORE.

A relatively new feature Accelerated Database Recovery is discussed. This can significantly speed up recovery time, however it does require more space in the database. The author generally doesn’t enable it, unless it’s specifically needed (e.g. slow database recovery due to long active transactions).

The chapter ends with a look at transaction log throughput. In essence if the log throughput can be improved, then typically query performance is improved too – however, as is often, the author recommends any problems with the underlying query should be fix too. It’s suggested you investigate the log-based waits, the general health of the IO subsystem, the physical location of the files, the size of the transaction writes (often having smaller batches can be better) and hardware. Some useful transaction-based Performance Monitor counters are examined.

This chapter provides plenty suggestions of areas for research that might improve log performance, together with useful code, and the author’s own insights and considered opinions.

Chapter 12. AlwaysOn Availability Groups

AlwaysOn Availability Groups (AOAG) is the most common High Availability (HA) technology. It replicates databases, removing it as a single point of failure. Additionally, scalability can be improved by using secondaries for reading data (this also takes load of the primary).

The chapter opens with an overview of the HA technologies that came before AOAG (e.g. mirroring). There are some useful diagrams showing a 3-Node availability group (AG), where the databases in the AG can failover as a group. There’s a helpful reminder that instance-level objects are not replicated (e.g. SQL Agent jobs), so these must be copied separately.

The major component of AOAG are the various queues, namely the send queue on the primary and the redo queue on the secondary, there are separate queues for each database in the AG. Again, some useful diagrams help the discussions. To improve performance and reduce latency, the queues should be as small as possible. Some very useful code is provided for monitoring the health/state/latency of the AG, together with useful information at the queues. Monitoring is very important and can be used to raise alerts if thresholds are breached.

Certain issues occur when there is synchronous replication between the primary and secondary. Here, transactions on the primary are committed only after they have been committed on the secondary. A synchronous link gives no data loss but it can increase latency. While the primary waits for this confirmation from the secondary, it issues a HADR_SYNC_COMMIT wait. This can be a common wait on OLTP systems, but it might be ‘normal’. However, this waiting on the primary can cause blocking. The author suggests this wait may be reduced by improving network performance, IO on the secondary, and better CPU bandwidth.

There’s a very useful tips about upgrading from SQL Server 2012 or 2014 to at least 2016, since it contains many improvements. The authors research (and included code) suggests upgrading will reduce the HADR_SYNC_COMMIT wait by 67% and CPU load by 33%, without any other changes.

Next, we look at Asynchronous replication, here the primary doesn’t wait for the secondary to commit data before the primary then commits the data. This has the advantage of reduced latency, but at the cost of some potential data loss. The usefulness of using the secondaries for reading data is discussed, together with some of the issues this may cause (e.g. deferred version store cleanup).

The chapter ends with a look at troubleshooting failovers. Various checks relating to IsAlive, IsHealthy and the LeaseTimeout are examined, together with how these might be changed to stop unnecessary failovers. There’s a very useful section on where to look for help with the cause of failovers, this includes AlwaysOn_health XE, system_health XE, SQLDIAG files, cluster log, SQL Server error log, and the windows system logs.

This chapter provided a very practical look at the common AOAG issues, and how you should go about investigating them. Useful code is provided.

Chapter 13. Other Notable Wait Types

We’ve seen from an earlier chapter that wait stats are important, they can be used to identify the main issues with an instance. Throughout the book, in the relevant chapters, the author has explained the common waits that relate to a given topic (e.g. CPU load). Here the author covers other miscellaneous but often common waits, explaining what they mean, and how they might be reduced.

The waits covered include:

ASYNC_NETWORK_IO – where the client can’t consume the data fast enough. This might relate to a slow network, or an incorrect pattern for using data in a dataset

THREADPOOL – typically means there’s not enough worker threads to assign tasks, giving client timeouts. Improving the SQL code can typically reduce this (e.g. remove blocking)

BACKUPIO & BACKUPBUFFER – backups can’t be written fast enough. Solutions include using backup compression, and optimizing BUFFERCOUNT and MAXTRANSFERSIZE values

HTBUILD and Other HT* Waits - may indicate poorly maintained ColumnStore indexes

The chapter ends with a useful overview of where to find more information about waits, their meaning, and potential solution. There’s a valuable reference to Paul Randal’s SQLSkills Wait Type Library (you will quickly get familiar with this site as you investigate waits).

This chapter provided an excellent look at some of the more common wait types that haven’t been covered already in earlier chapters. As always, some very useful discussions and accompanying code is provided.

Chapter 14. Database Schema and Index Analysis

Heap tables - tables generally perform better if they have a clustered index

Indexes on uniqueidentifier columns – these often lead to page splits, using NEWSEQUENTIALID() can help

Wide and nonunique clustered indexes – ideally a clustered index should have a narrow, static, unique value

Nonindexed foreign keys – by default SQL Server doesn’t index foreign keys (FKs). These are typically used to join a child to a parent table and for referential integrity. Indexing FKs usually helps improve performance

Redundant indexes – sometimes systems have the same or near-identical indexes, these can increase transaction duration and blocking

In all cases, useful illustrative code is provided to support the discussions and assertions.

sys.dm_db_index_usage_stats - query level, how many times queries used an index, data grouped by access method (e.g. seeks), and how often updated

sys.dm_db_index_operational_stats - row level, low level stats on access method, locking, latching, IO, etc. great for troubleshooting index performance, locking, latching problems

Chapter 15. SQL Server in Virtualized Environments

Next, there’s a look at configuration, including:

Capacity planning – Admins often want as many Virtual Machines (VMs) as possible on a host. While it may be acceptable to overprovision resources on non-production systems, on production systems resources should be more considered

CPU configuration – adding CPUs might degrade performance, having fewer CPUs can outperform having more CPUs if the available CPUs can cater for the workloads

Network – it’s suggested there is a separate vNet for the heartbeat check, and infrastructure should prioritize this