SQL Server Advanced Troubleshooting and Performance Tuning (O'Reilly)
Article Index
SQL Server Advanced Troubleshooting and Performance Tuning (O'Reilly)
Chapters 3 - 6
Chapters 7 - 9
Chapters 10 -13
Chapter 14 on; Conclusion

Chapter 10. Latches

Latches are similar to locks, locks provide transaction data consistency, whereas latches prevent corruption of data structures in memory. Latches are typically short-lived, but they can be a problem as the load increases. 

The chapter opens with a look at what latches are and the different types (i.e. keep, shared, update, exclusive, destroy). When a latch cannot be obtained on an object, a latch wait is generated, examining these waits can give an indication of the problem. The latch-related waits examined are:

  • PAGEIOLATCH – waiting for data to be read into buffer from disk, often due to poor IO subsystem and/or poor SQL code.
  • PAGELATCH – waiting because need to update buffer pool and allocation map pages – often due to tempdb system catalog contention
  • LATCH – all other latch waits not related to buffer pool

Some useful code is provided to illustrate latches that occur due to hotspots on tables where many sessions are making updates. Various solutions to this problem are given (e.g. use in-memory OLTP which doesn’t have latches).

The author states that documentation of latches is relatively sparse, and then provides a useful but brief explanation of other wait latch types.

This chapter provided a useful look at problems that occur relatively infrequently, but when they do occur, they often need to be solved.

Banner

Chapter 11. Transaction Log

Each database has a transaction log, this can perform point in time recovery, restore from a crash, and rollback of the current transactions.

The chapter opens with a look at the internals of the transaction log (e.g. Log Sequence Number, Write Ahead Log). There’s a useful walkthrough on data modifications and how this is reflected in the transaction log. Some common log waits are examined, and solutions given (e.g. WRITELOG). The use of implicit and explicit transactions is explored, with their impact on performance and log size. Delayed durability can improve performance but at the cost of some data loss, generally the author prefers to avoid it, and instead focusses on other log improvements.

Next, common problems related to log truncation are discussed. The column log_reuse_wait_desc in the system view sys.databases contains the underlying reason why a log is not truncated. Some of the common values are discussed including: LOG_BACKUP, ACTIVE_TRANSACTION, AVAILABILITY_REPLICA, and ACTIVE_BACKUP_OR_RESTORE.

A relatively new feature Accelerated Database Recovery is discussed. This can significantly speed up recovery time, however it does require more space in the database. The author generally doesn’t enable it, unless it’s specifically needed (e.g. slow database recovery due to long active transactions).

The chapter ends with a look at transaction log throughput. In essence if the log throughput can be improved, then typically query performance is improved too – however, as is often, the author recommends any problems with the underlying query should be fix too. It’s suggested you investigate the log-based waits, the general health of the IO subsystem, the physical location of the files, the size of the transaction writes (often having smaller batches can be better) and hardware. Some useful transaction-based Performance Monitor counters are examined. 

This chapter provides plenty suggestions of areas for research that might improve log performance, together with useful code, and the author’s own insights and considered opinions.

Chapter 12. AlwaysOn Availability Groups

AlwaysOn Availability Groups (AOAG) is the most common High Availability (HA) technology. It replicates databases, removing it as a single point of failure. Additionally, scalability can be improved by using secondaries for reading data (this also takes load of the primary).

The chapter opens with an overview of the HA technologies that came before AOAG (e.g. mirroring). There are some useful diagrams showing a 3-Node availability group (AG), where the databases in the AG can failover as a group. There’s a helpful reminder that instance-level objects are not replicated (e.g. SQL Agent jobs), so these must be copied separately.

The major component of AOAG are the various queues, namely the send queue on the primary and the redo queue on the secondary, there are separate queues for each database in the AG. Again, some useful diagrams help the discussions. To improve performance and reduce latency, the queues should be as small as possible. Some very useful code is provided for monitoring the health/state/latency of the AG, together with useful information at the queues. Monitoring is very important and can be used to raise alerts if thresholds are breached.

Certain issues occur when there is synchronous replication between the primary and secondary. Here, transactions on the primary are committed only after they have been committed on the secondary. A synchronous link gives no data loss but it can increase latency. While the primary waits for this confirmation from the secondary, it issues a HADR_SYNC_COMMIT wait. This can be a common wait on OLTP systems, but it might be ‘normal’. However, this waiting on the primary can cause blocking. The author suggests this wait may be reduced by improving network performance, IO on the secondary, and better CPU bandwidth. 

There’s a very useful tips about upgrading from SQL Server 2012 or 2014 to at least 2016, since it contains many improvements. The authors research (and included code) suggests upgrading will reduce the HADR_SYNC_COMMIT wait by 67% and CPU load by 33%, without any other changes. 

Next, we look at Asynchronous replication, here the primary doesn’t wait for the secondary to commit data before the primary then commits the data. This has the advantage of reduced latency, but at the cost of some potential data loss. The usefulness of using the secondaries for reading data is discussed, together with some of the issues this may cause (e.g. deferred version store cleanup).

The chapter ends with a look at troubleshooting failovers. Various checks relating to IsAlive, IsHealthy and the LeaseTimeout are examined, together with how these might be changed to stop unnecessary failovers. There’s a very useful section on where to look for help with the cause of failovers, this includes AlwaysOn_health XE, system_health XE, SQLDIAG files, cluster log, SQL Server error log, and the windows system logs. 

This chapter provided a very practical look at the common AOAG issues, and how you should go about investigating them. Useful code is provided. 

Chapter 13. Other Notable Wait Types

We’ve seen from an earlier chapter that wait stats are important, they can be used to identify the main issues with an instance. Throughout the book, in the relevant chapters, the author has explained the common waits that relate to a given topic (e.g. CPU load). Here the author covers other miscellaneous but often common waits, explaining what they mean, and how they might be reduced.

The waits covered include: 

  • ASYNC_NETWORK_IO – where the client can’t consume the data fast enough. This might relate to a slow network, or an incorrect pattern for using data in a dataset
  • THREADPOOL – typically means there’s not enough worker threads to assign tasks, giving client timeouts. Improving the SQL code can tpically reduce this (e.g. remove blocking)
  • BACKUPIO & BACKUPBUFFER – backups can’t be written fast enough. Solutions include using backup compression, and optimizing BUFFERCOUNT and MAXTRANSFERSIZE values
  • HTBUILD and Other HT* Waits - may indicate poorly maintained ColumnStore indexes

The chapter ends with a useful overview of where to find more information about waits, their meaning, and potential solution. There’s a valuable reference to Paul Randal’s SQLSkills Wait Type Library (you will quickly get familiar with this site as you investigate waits).

This chapter provided an excellent look at some of the more common wait types that haven’t been covered already in earlier chapters. As always, some very useful discussions and accompanying code is provided. 



Last Updated ( Wednesday, 24 August 2022 )