Page 2 of 3
Chapter 6 Containers
Containers help SSIS provide structure to Control Flow tasks. It is possible to loop through all the tasks within a container, typically performing the tasks a given number of times with variable input. There are 3 types of container:
- Sequence (logically groups tasks, allows them to be treated as a unit)
- For Loop (enables looping within package until a condition is met)
- Foreach Loop (enables looping through a collection of objects)
Examples are provided of the various containers, and many of their subtypes too, e.g. looping through the rows within a table, or files within a directory.
In addition to containers, Groups can be used to visually group tasks on the screen, but they have no other effect on the relatedness of the grouped tasks.
Chapter 7 Joining Data
Often, related data may be in a different database and even in a different format (e.g. flat file). We may need to join disparate data to get the most from it. The Lookup transform and the Merge transform are typically used to join data.
The Lookup transform can be used to join 2 tables at a time (so to join 3 tables you need 2 Lookup transforms). Various caching options are included, allowing data to be held fully in cache, not in cache, or partially in cache. The use of the cache option is typically determined by available resources, and testing should reveal the correct balance between memory and performance.
The Merge Join transform is typically used when you need to process large volumes of data, the data should be sorted, so you may need to pre-sort the data first.
Often, developers new to SSIS will want to perform their joins within the SQL Server database, a later chapter discusses the advantage and disadvantages of this (note: sometimes all the data is not in a relational database).
There are useful examples provided which demonstrate several Lookup transform solutions based on the various cache modes. Similarly, Merge Join examples are also provided.
The chapter provides standard solutions to typical ‘join’ scenarios. Some useful tips are provided e.g.
- have a trash Union All transform to act as a dummy destination
- use the Data Viewer to see the data content
- reduce input to Lookup transform, since it can be an expensive operation
- there are often no standard SSIS solutions, so experiment!
Chapter 8 Creating an End-To-End Package
In many ways, the previous chapters have provided a good grounding in the basics of SSIS development, the subsequent chapters expand on this. This chapter takes stock of what has been said so far, and provides useful examples of: a basic transformation, mainframe ETL with data cleansing, and making packages dynamic.
The examples are detailed, with plenty of screenshots and explanation, so it’s easy to follow along with. You will need to download certain files from the Wrox web site to follow along. The interesting concept of self-healing solutions is introduced with the examples, an approach that all developers should consider, i.e. anticipate potential problems and introduce corrections for them.
Chapter 9 Scripting in SSIS
Scripting is often undertaken when you can’t find a component that easily does what you require. Scripting is very similar to writing .NET code using C# or VB.NET, the code and classes are created within Visual Studio Tools for Applications (VSTA). The chapter creates a ‘Hello World’ example to get you started and familiar with the scripting environment.
The script task is used in Control Flow, so is called once, providing powerful functionality that is not readily available in other components. The script task editor allows the setting of the script language, entry point in the code, and the setting of read only and editable variables. The underlying script task’s DTS object is examined with reference to its collections, allowing you to interact with a running package. Examples of accessing script variables, retrieving database data into variables, getting files from a FTP server, and serializing data to XML are given, in both C# and VB.NET. This is followed by examples of raising and responding to events within script tasks.
Similarly, the script component is part of the Data Flow, and can be called per row. The component can be of 3 possible types: Source type (provides data), Destination type (consumes data), and Transform type (apply custom transformation to data). Again, plenty of examples are given.
Some helpful coding and troubleshooting tips are provided, including the use of Structured Error Handling (Try/Catch), setting breakpoints and stepping through code, using the Row Count and Data Viewer components to check data, and using various debug windows (autos, locals, watches, and immediate window).
The chapter is especially useful if you come from a .NET (C# or VB.NET) background. These components require explicit coding, as opposed to drag-and-drop and setting properties of the other components. There are plenty of examples to help you learn how to code the solutions to the given problems. The section on debugging techniques should prove useful.
Chapter 10 Advanced Data Cleansing in SSIS
This chapter looks at one of the main uses of SSIS, to cleanse data, which typically occurs as you move data from its source to its destination. It can be argued that every transform is a type of cleansing.
There are many useful step-by-step cleansing examples given, including:
- Derived columns (cleansing blanks, nulls etc)
- Fuzzy Lookup (use external reference data to cleanse)
- Fuzzy grouping (remove duplicates from the same source)
- Data Quality Services (DQS) Cleansing (perform cleansing based on predefined rules)
SSIS provides many potential solutions to cleansing data, from simple derived columns to using rules. For large, more rules-based cleansing, the section on DQS and Master Data Management has useful examples together with references to further information.
There’s a useful tip about the Fuzzy Lookup being an expensive operation, so it should only be applied to rows that don’t match any initial conditions.
Chapter 11 Incremental Loads in SSIS
The examples in the book so far have related to a full load, or truncate and full load. In the real world, since a full load can be expensive, you’re more likely to deal with incremental data i.e. load data since a given date/time. This chapter is concerned with patterns and examples relating to incremental loading.
The first pattern considered is the Control Table Pattern. This uses a control table to determine when the data was last loaded, this is then used to determine what data is new, and thus should be loaded. A step-by-step example of this common pattern is provided.
The second pattern is the SQL Server Change Data Capture (CDC) Pattern. CDC requires the Enterprise Edition of SQL Server. In many ways CDC automates the previous pattern, any relevant changed data is read from the transaction log. The chapter discusses how to set up CDC, and how to use the CDC Control task, CDC source and CDC Splitter in your incremental load solutions.
This is another detailed and useful chapter, with plenty of step-by-step examples.
Chapter 12 Loading a Data Warehouse
This chapter discusses the very common task of extracting data from a source, storing it in a staging area, applying any transformations, and loading it into a data warehouse (or data mart).
The following components are discussed in some detail, with helpful examples:
- Data profiling (used to get an understanding of the data, how clean it is, number of nulls/blanks , data patterns etc)
- Slowly Changing Dimension (recording data that changes over time. Discusses, with examples, the loading of Dimension and Fact tables, and the use of the SCD Wizard)
- Analysis Services Execution DDL (discusses processing of SSAS objects to analyze cubes)
This chapter has a considered overview, with very good examples, of the steps and tasks involved in loading a data warehouse. The importance of planning and research in the initial analysis stages is discussed with reference to data profiling and business users.
Chapter 13 Using the Relational Engine
This chapter explains that some tasks are better performed within the database, while others are better performed in SSIS – it’s important to choose the right tool for the job. In many cases the choice is already made, since many data sources are not in a database!
A useful list of common SQL best practices is discussed, these include
- SELECT * is bad (do you need all the columns?)
- WHERE is your friend (filter data)
- Transform during the Extract
- Use ANDs to extract related data
- Sort in the database
- Modularize (i.e. use Stored Procedures)
- Use Set-based logic (rather than cursors)
There’s a useful point about using database snapshots (these contain database changes made since the snapshot was created) for testing – making it very easy to restore the database. Perhaps this great tip could have been explained in a much wider context, not just in relation to SSIS.
There’s a useful discussion about using the MERGE command, this applies update operations (insert, delete, update) as a single transaction, and is typically better for performance i.e. load data once and apply.
Chapter 14 Accessing Heterogeneous Data
The chapter is concerned with working with data from various non-SQL Server sources. This is a common scenario for SSIS developers, accessing data in Excel, flat files, XML, web services, Oracle etc.
For each of the heterogeneous sources considered, a useful example is provided, with any quirks explained (e.g. 32 bit versions). Links are given to third parties that provide connectivity solutions for specific adapters (e.g. SAP connectors).
This is a useful chapter to consult when connecting to a non-SQL Server data source. I do wonder if it should have been presented earlier in the book.