Sunday, 19 August 2012

SSIS – Transformations

SSIS supports numerous transformations that allow you to combine data originating from multiple sources, cleanse the data and give it the shape your data destination expects. Then you can import the data into a single or multiple destinations.

Transformation Description Examples of when Transformation Would be Used
Aggregate Calculates aggregations such as SUM, COUNT, AVG, MIN and MAX based on the values of a given numeric column. This transformation produces additional output records. Adding aggregated information to your output. This can be useful for adding totals and sub-totals to your output.
Audit Includes auditing information, such as computer name where the package runs, package version ID, task name, etc in the data flow. Creates advanced logs which indicate where and when the package was executed, how long it took to run the package and the outcome of execution.
Character Map Performs minor manipulations on string columns. Converts all letters to uppercase, lowercase, reverse bytes, etc. Applying string manipulations prior to loading data into the data warehouse. You can also apply the same manipulations to the data while it is being loaded into the warehouse.
Conditional Split Accepts an input and determines which destination to pipe the data into based on the result of an expression. Cleansing the data to extract specific rows from the source. If a specific column does not conform to the predefined format (perhaps it has leading spaces or zeros), move such records to the error file.
Copy Column Makes a copy of a single or multiple columns which will be further transformed by subsequent tasks in the package. Extracting columns that need to be cleansed of leading / trailing spaces, applying character map transformation to uppercase all data and then load it into the table.
Data Conversion Converts input columns from one data type to another. Converting columns extracted from the data source to the proper data type expected by the data warehouse. Having such transformation options allows us the freedom of moving data directly from its source into the destination without having an intermediary staging database.
Data Mining Query Queries a data mining model. Includes a query builder to assist you with development of Data Mining eXpressions (DMX) prediction queries. Evaluating the input data set against a data mining model developed with Analysis Services.
Derived Column Calculates new column value based on an existing column or multiple columns. Removing leading and trailing spaces from a column. Add title of courtesy (Mr., Mrs., Dr, etc) to the name.
Export Column Exports contents of large columns (TEXT, NTEXT, IMAGE data types) into files. Saving large strings or images into files while moving the rest of the columns into a transactional database or data warehouse.
Fuzzy Grouping Finds close or exact matches between multiple rows in the data source. Adds columns to the output including the values and similarity scores. Cleansing data by translating various versions of the same value to a common identifier. For example, “Dr”, “Dr.”, “doctor”, “M.D.” should all be considered equivalent.
Fuzzy Lookup Compares values in the input data source rows to values in the lookup table. Finds the exact matches as well as those values that are similar. Cleansing data by translating various versions of the same value to a common identifier. For example, “Dr”, “Dr.”, “doctor”, “M.D.” should all be considered equivalent.
Import Column Imports contents of a file and appends to the output. Can be used to append TEXT, NTEXT and IMAGE data columns to the input obtained from a separate data source. This transformation could be useful for web content developers. For example, suppose you offer college courses online. Normalized course meta-data, such as course_id, name, and description is stored in a typical relational table. Unstructured course meta-data, on the other hand, is stored in XML files. You can use Import Column transformation to add XML meta-data to a text column in your course table.
Lookup Joins the input data set to the reference table, view or row set created by a SQL statement to lookup corresponding values. If some rows in the input data do not have corresponding rows in the lookup table then you must redirect such rows to a different output. Obtaining additional data columns. For example, the majority of employee demographic information might be available in a flat file, but other data such as department where each employee works, their employment start date and job grade might be available from a table in relational database.
Merge Merges two sorted inputs into a single output based on the values of the key columns in each data set. Merged columns must have either identical or compatible data types. For example you can merge VARCHAR(30) and VARCHAR(50) columns. You cannot merge INT and DATETIME columns. Combining the columns from multiple data sources into a single row set prior to populating a dimension table in a data warehouse. Using Merge transformation saves the step of having a temporary staging area. With prior versions of SQL Server you had to populate the staging area first if your data warehouse had multiple transactional data sources.
Merge Join Joins two sorted inputs using INNER JOIN, LEFT OUTER JOIN or FULL OUTER JOIN algorithm. You can specify columns used for joining inputs. Combining the columns from multiple data sources into a single row set prior to populating a dimension table in a data warehouse. Using Merge Join transformation saves the step of having a temporary staging area. With prior versions of SQL Server you had to populate the staging area first if your data warehouse had multiple transactional data sources. Note that Merge and Merge Join transformations can only combine two data sets at a time. However, you could use multiple Merge Join transformations to include additional data sets.
Multicast Similar to the conditional split transformation, but the entire data set is piped to multiple destinations. Populating the relational warehouse as well as the source file with the output of a derived column transformation.
OLEDB Command Runs a SQL command for each input data row. Normally your SQL statement will include a parameter (denoted by the question mark), for example: UPDATE employee_source SET has_been_loaded=1 WHERE employee_id=? Setting the value of a column with BIT data type (perhaps called “has_been_loaded”) to 1 after the data row has been loaded into the warehouse. This way the subsequent loads will only attempt importing the rows that haven’t made it to the warehouse as of yet.
Percentage Sampling Loads only a subset of your data, defined as the percentage of all rows in the data source. Note that rows are chosen randomly. Limiting the data set during development phases of your project. Your data sources might contain billions of rows. Processing cubes against the entire data set can be prohibitively lengthy. If you’re simply trying to ensure that your warehouse functions properly and data values on transactional reports match the values obtained from your Analysis Services cubes you might wish to only load a subset of data into your cubes.
Pivot Pivots the normalized data set by certain column to create a more easily readable output. Similar to PIVOT command in Transact-SQL. You can think of this transformation as converting rows into columns. For example if your input rows have customer, account number and account balance columns the output will have the customer and one column for each account. Creating a row set that displays the table data in a more user-friendly format. The data set could be consumed by a web service or could be distributed to users through email.
Row count Counts the number of transformed rows and store in a variable. Determining the total size of your data set. You could also execute a different set of tasks based on the number of rows you have transformed. For example, if you increase the number of rows in your fact table by 5% you could perform no maintenance. If you increase the size of the table by 50% you might wish to rebuild the clustered index.
Row sampling Loads only a subset of your data, defined as the number of rows. Note that rows are chosen randomly. Limiting the data set during development phases of your project. Your data warehouse might contain billions of rows. Processing cubes against the entire data set can be prohibitively lengthy. If you’re simply trying to ensure that your warehouse functions properly and data values on transactional reports match the values obtained from your Analysis Services cubes you might wish to only load a subset of data into your cubes.
Script Component Every data flow consists of three main components: source, destination and transformation. Script Component allows you to write transformations for otherwise un-supported source and destination file formats. Script component also allows you to perform transformations not directly available through the built-in transformation algorithms. Custom transformations can call functions in managed assemblies, including .NET framework. This type of transformation can be used when the data source (or destination) file format cannot be managed by typical connection managers. For example, some log files might not have tabular data structures. At times you might also need to parse strings one character at a time to import only the needed data elements.
Much like Script Task the Script Component transformation must be written using Visual Basic .NET.
Slowly Changing Dimension Maintains historical values of the dimension members when new members are introduced. Useful for maintaining dimension tables in a data warehouse when maintaining historical dimension member values is necessary.
Sort Sorts input by column values. You can sort the input by multiple columns in either ascending or descending order. The transformation also allows you to specify the precedence of columns used for sorting. This transformation could also discard the rows with duplicate sort values. Ordering the data prior to loading it into a data warehouse. This could be useful if you’re ordering your dimension by member name values as opposed to sorting by member keys. You can also use Sort transformation prior to feeding the data as the input to the Merge Join or Merge transformation.
Term Extraction Extracts terms (nouns and noun phrases) from the input text into the transformation output column. Processing large text data and extracting main concepts. For example, you could extract the primary terms used in this section of SQLServerPedia by feeding the Term Extraction transformation the text column containing the entire section.
Term Lookup Extracts terms from the input column with TEXT data type and match them with same or similar terms found in the lookup table. Each term found in the lookup table is scanned for in the input column. If the term is found the transformation returns the value as well as the number of times it occurs in the row. You can configure this transformation to perform case-sensitive search. Analyzing large textual data for specific terms. For example, suppose you accept email feedback for latest version of your software. You might not have time to read through every single email messages that comes to the generic inbox. Instead you could use this task to look for specific terms of interest.
Union ALL Combines multiple inputs into a single output. Rows are sorted in the order they’re added to the transformation. You can ignore some columns from each output, but each output column must be mapped to at least one input column. Import data from multiple disparate data sources into a single destination. For example, you could extract data from mail system, text file, Excel spreadsheet and Access database and populate a SQL Server table. Unlike Merge and Merge Join transformations Union ALL can accept more than two inputs.
Unpivot Opposite of Pivot transformation, Unpivot coverts columns into rows. It normalizes the input data set that has many duplicate values in multiple columns by creating multiple rows that have the same value in a single column.
For example if your input has a customer name and a separate column for checking and savings’ accounts Unpivot can transform it into a row set that has customer, account and account balance columns.
Massaging a semi-structured input data file and convert it into a normalized input prior to loading data into a warehouse.

No comments:

Post a Comment