Author: Adam Aspin
Audience: Developers needing to bring data into SQL Server
Reviewer: Kay Ewbank
Data integration is quite a narrow topic in the overall area of databases, but it’s one that you have to get right if your database project is going to be successful.
This book gives detailed step by step instructions for taking data to and from a wide variety of data sources, and for how work with the data once you’ve got it into SQL Server. This might seem an easy topic - why do you need to read a heavyweight book to learn how to import and clean up data? The aim here is to provide as much automation as possible, though, to save on time and errors, so this isn’t just about how to carry out a task, but how to include it as part of a T-SQL or VB.NET script.
The first seven chapters of the book look at how to get data into and out of SQL Server, with examples showing how to connect to different data sources, and what problems you’re likely to encounter. Aspin then moves on to the wider aspects of ETL, looking at metadata analysis, data transformation, profiling source data, and ways you can optimize the data load.
The examples are mainly written in T-SQL, with a smattering of MDX where Aspin is showing how to handle Analysis Services data. Where he’s writing SSIS scripts he mainly uses VB.NET, though there are a few examples that use C#. The book covers pretty much everything about using SSIS (SQL Server Integration Services), and also covers other aspects of ETL.
The early chapters each take a single data source, and give step by step instructions on importing and exporting data to and from it, so you’re shown how to script the import and export data from Microsoft Office (Excel and Access), flat files, XML, SQL, SQL Server, and a round-up of miscellaneous sources. The details go further than just import and export; Aspin looks at modifying the data while loading and choosing the subset of data to load. He also shows how to deal with specific problems such as converting a complex Access database to SQL Server, using BULK INSERT, and flattening XML files prior to insert. The chapter on loading SQL files covers Oracle, DB2, MySQL, Sybase, Teradata and Postgre. The miscellaneous options are SQL Server Analysis Services, OLAP, images and documents, Visual FoxPro, dBASE, ODBC, and Windows Management Instrumentation.
Getting the data into SQL Server is the easy bit, though, and Aspin then moves on to making it usable. There’s a good chapter explaining metadata and how to analyze and query metadata from a number of SQL databases, and a thorough coverage of data transformation, including pivoting data, normalizing, consolidating multiple sources, and dealing with the various different types of slowly changing dimensions. Aspin shows how to use T-SQL and alternatively SSIS in each case. Data profiling to gain a high level understanding of what you’ve got or not got in particular columns of data is the next topic, with in-depth coverage of using the SSIS Data Profiling Task alongside alternative methods.
Once the data is all loaded and worked over, you’ll probably need to reload those data values that have changed, a topic Aspin calls Delta Data Management. He looks at the options of working out what data has changed externally and alternatively identifying it during the SQL Server ETL Load, with analysis of what the advantages and disadvantages are. The related topic of change tracking gets its own chapter, as does ways to make ETL faster. The topic of organizing and optimizing data loads is covered particularly well, and could save you a lot of time if you follow the advice. The chapter on Logging and Auditing and using the logging framework is also very good, and brings the main book to a close.
Most database developers would find some elements of this book useful. You’re unlikely ever to need all of it - if you’re working with DB2 data, the section on working with Access will be wasted, for instance, and in general you’ll probably only use one small part of each chapter - the bit relevant to your particular circumstances. However, the scripts and T-SQL are clearly written and explained, and while you may only use a small percentage of the overall material, that’s down to the multiple options rather than any weakness of the book.