SQL Server Internals: In-Memory OLTP

Article Index
SQL Server Internals: In-Memory OLTP
Chapters 3 - 6
Chapters 7 - 8, Conclusion

Page 2 of 3

Chapter 3 Row Structure and Multi-Version Concurrency

This chapter describes how the rows are linked together through their indexes. Rows are never updated, instead the current row is marked as invalid and a new version of the row created. Potentially the same row can be updated many times during a transaction, creating many versions of the row. SQL Server knows the correct row to display based on the timestamp in the row's header, and the transaction start time.

Next the row structure itself is described in detail. The row header contains: Begin timestamp, End timestamp, statement Id, index count, padding, and one or more index pointers. The row payload contains the row's data.

The chapter then describes various scenarios to illustrate how different row versions can be produced with the MVCC model, and how the row's timestamps are used together with the transaction start time to determine which rows are valid for the transaction. When a row is first created it has a begin timestamp that reflects the starting transaction, when the row is committed the begin timestamp contains the commit timestamp. When a row is 'deleted', the end timestamp is populated (remember there are no updates, an update is represented as a delete and insert of a given row, as another row version). The older invalid row versions are subsequently removed via the garbage collector.

The various transaction processing phases are discussed, namely:

Processing (various examples provided, read/write and write/write)
Validation (after 'commit', check there are no violations of isolation level properties)
Post-processing (writes commit timestamp to row header)

This chapter describes the structure of rows (header and payload), and its use in the optimistic multi-version concurrency model to obtain the correct row, based on the row's timestamps and the transaction's start time. Examples are provided of read/write and write/write scenarios and how these are resolved.

Chapter 4 Hash and Range Index

The row header contains a set of index pointers, one for each index, which point to the next row in that index, the indexes connect rows together, thus giving structure. The chapter reiterates some of the properties of In-Memory indexes: there must be at least 1 index, there is a maximum of 8 indexes including the primary key, indexes cannot be altered – instead the table must be deleted and recreated, also during recovery all the indexes are recreated. There is no fragmentation since all elements in memory are equally accessible.

The chapter now dives deep into the structure of hash indexes, these are stored in a hash table (array of pointers), where each element is a hash bucket that points to a data row. A hash function is applied to the index column data in each relevant row. Ideally you should have only 1 entry per hash bucket. A detailed discussion of the structure of 2 hash indexes is given, illustrated with useful diagrams. Hash indexes are pre-allocated based on the number of hash buckets specified.

It is suggested that hash indexes should be used when you have targeted equality searches, with high cardinality (few duplicates), the primary key often provides this. The hash index is typically less useful when there are many duplicates (unless you want all the values). It should be noted that the hash is based on all of the index's columns – so partial index use is not possible. It is important to estimate a good value for the number of hash buckets (that hold the hash values), since too few will lead to buckets holding duplicates, and too many will lead to wasted storage - the new DMV sys.dm_db_xtp_hash_index_stats can help with this.

The chapter continues with the deep dive into the structure of range indexes, these use Bw-tree structures which are similar to the traditional B-tree index structures (except the pages are not fixed and can't be changed once built). Range indexes are typically used when you don't know the number of buckets or cardinality. Index pages are never updated, instead they are replaced with new pages, and remapped, and a delta record for the change is produced. The structure of the index page is discussed (e.g. Header area for Pid, page type, right pid, height, page stats, max key – and extra things at leaf and internal pages level). Range indexes are not pre-allocated, but increase in size as more rows are added.

Range index Bw-tree maintenance is discussed next, and involves:

Consolidating delta records (long chains of delta records can degrade performance, the page is rebuilt when the chain is over 16 elements. Old pages are marked for garbage collection)
Splitting a full index page(a page can grow until it reaches 8K, it then splits)
Merging adjacent index pages (when deletes results in a page that is less than 10% full, it is merged with adjacent pages. 3 steps: create new delta page for delete, create new non-leaf page with correct index entries, merge pages and remove deltas)

This chapter provides an in-depth look at the structure of both hash and range indexes, with supported examples and diagrams, additionally, recommendations of when to use each index type are given. Various limitations of the indexes are detailed (e.g. can't alter index).

Chapter 5 Transaction Processing

The chapter opens with a discussion on the importance of preserving the ACID (Atomic, Consistent, Isolated, Durable) properties of transactions. The isolation level shows the interaction of different transactions. Transactions are discussed with reference to autocommit (implicit, wraps around single statements) and explicit (wraps around groups of statements, starts with BEGIN TRAN).

Transaction isolation levels are examined, with reference to changes made by other concurrent transactions. The isolation levels associated with disk based tables are explained together with the common problems they have. Next a similar exercise is undertaken with In-Memory tables. With disk-based tables, data integrity is maintained by the use of locks, but on In-Memory tables there are no lock, instead there are row versions.

Rules are given for cross-container transactions, for transaction consistency. In this case the transaction can be viewed as having 2 parts, one relating to disk-based tables and the other to In-Memory tables. There's a useful grid of the isolation levels that can be used together in cross-container transactions.

The chapter progresses onto choosing the correct isolation level for In-Memory tables. The simplest and commonest for MVCC is snapshot isolation. A new database property (MEMORY_OPTIMIZED_ELEVATE_TO_SNAPSHOT) can be set to automatically upgrade the isolation to snapshot, otherwise a table hint can be supplied. Monitoring of active transactions using the DMV sys.dm_db_xtp_transactions is briefly discussed.

Various transaction processing examples are provided, discussing:

Write-write conflicts (if another transaction tries to update a row that you've updated but yet to commit, the other transaction fails immediately with an update conflict error)
Read-write conflicts (here most other transaction errors are caught after validation phase)

The validation phase occurs when the SQL issues a commit, a check is made to ensure there are no violations of isolation level properties, if there are no problems, the final commit is applied. The post-processing (i.e. commit) phase then updates the timestamps for rows inserted or deleted by the transaction (deletes set the end timestamp, and inserts set the begin timestamp). The unlinking and deletion of old row versions is done by the garbage collector, typically some time later.

The chapter ends with a look at the work of the garbage collector, which removes invalid rows (delete, updates, aborted inserts all create invalid row versions). The invalid rows can degrade scans and take up unnecessarily memory. SQL server uses the oldest running transaction to determine if row versions are still needed.

This chapter discusses transaction management for In-Memory tables, in relation to the supported isolation levels. Sample code is provided and used to discuss the underlying processing. Scenarios that produce potential validation errors are discussed together with how they can be resolved. The validation and post-processing phases, and the work of the garbage collector are explained in detail.

One thing that struck me while reading this chapter, the term ACID isn't defined anywhere in the book, and there isn't an index entry for ACID either.

Chapter 6 Logging, Checkpoint, and Recovery

Durability is critical for recoverability. In-Memory OLTP uses the checkpoint process (where data files have inserted data, delta files have deleted data for the transaction), and the transaction log (contains committed data) to provide recoverability. AlwaysOn availability groups provide additional support.

The chapter looks at the transaction log entries generated via In-Memory OLTP. The log use is potentially much more scalable since multiple concurrent log streams can write to multiple logs, however this is currently limited by SQL Server to 1 log file. Performance is improved since the log processing is more efficient, it only logs table changes not index changes, it combines multiple changes into a single log record, and only writes log rows for committed transactions. Useful example code is provided that compares the logs produced by disk-based and In-Memory table changes.

In-Memory checkpoints aim to reduce the amount of recovery time. If the size of the transaction log increases by more than 512 MB, since the last In-Memory checkpoint, another In-Memory checkpoint is run (even if there are no In-Memory table changes in the log!).

The chapter proceeds with an in-depth look at the following checkpoint processes:
Anatomy of the checkpoint files (checkpoint streams are stored in SQL Server Filestream files, they occur in pairs, the data file contains inserts and the delta file has the deletions)
Continuous checkpointing and events (the checkpoint files undergo various processing e.g. under construction, active, merge target. The DMV sys.dm_db_xtp_checkpoint_files can be used to monitor this)
Merging checkpoint files (adjacent files can be merged when the active content drops below a threshold, thus improving performance. Examples of manual and automatic merges given)
Garbage collection of checkpoint files (when checkpoint files have merged, and have state of merged source, they are no longer needed and can be garbage collected)

The chapter ends with a look at the recovery process. In essence the most recent checkpoint inventory is obtained from the scan of the tail of the transaction log, and the location of this is passed to the OLTP engine which performs parallel recovery. After the checkpoint file completes, the tail of the log is replayed from the timestamp of the last checkpoint, to bring the database back to the state that existed before the crash.

This chapter discusses the importance of logging and checkpoints, their components and structures, and how they can be used in recovery. As always, there is plenty of detail supported by useful example code and diagrams.

<< Prev - Next >>

Last Updated ( Thursday, 01 January 2015 )