Monday 17 September 2012

SCD - Slowly Changing Dimensions transformation

Slowly changing dimension transformation was one of the mystery transformation for me till now, as i never had chance to actually implement it. So today i block my time to implement this transformation by taking a simple example.


The concept of slowly changing dimension is explained very well and in detail on wikipedia, i would request readers to have a look at below link if you are not aware of SCD concept.

http://en.wikipedia.org/wiki/Slowly_changing_dimension

Before starting on our example, let me give you some glance of the SCD




 transformation provided in SSIS 2008. This transformation is used to implement Type 1 and Type 2 SCD, for other types we need to add some custom logic to our ETL. (Type 1: Identify the changes and update the record, no history will be recorded, Type 2: Any change identified we expire the old record and create new record with new values, here we save history information in old record)

OK.. Lets take simple Employee dimension example... in this example i am getting EmployeeFeed.xls file as input for my DimEmployee table (which is my dimension table) and i am using SCD transformation to identify any changes and implement DimEmployee as Type 2.

DimEmployee :

Create  table DimEmployee


EmpKey
int identity(1,1),

EmpId
int,

Name
varchar(100),

Designation
varchar(100),

City
varchar(100),

Phone
varchar(10),

StartDate
datetime,

EndDate
datetime

)


So before we start implementing any SCD, we need to first identify attribute in the dimension table for which we need to track changes. In this example i want to track changes for Designation, City and Phone attributes. I am expecting no change in Name attribute or column.

You might have noticed that there are two columns EmpId and EmpKey why these columns are needed in dimension table??

Ans:
EmpId : This is a Business Key, which uniquely identifies a employee in entire data warehouse system.

EmpKey : Is a Surrogate key, which uniquely identifies record in dimension table, and also its a key to identify historical records.

We also have two more columns StartDate and EndDate, these two columns are used to track time of changes, if EndDate is null it means the record is most recent record in dimension table.

Steps to Implement SCD in a data flow.

1. After we add Source (which is excel in our case EmployeeFeed.xls), we need to add Data conversion transformation, to correct if there are any Data type conflicts.
2. Then we add SCD transformation to Data flow, and this will open SCD wizard, Click next on welcome screen.

3. On Select a Dimensoin Table And Keys page, select your dimension table in this case its DimEmployee, Map all the columns from source excel to destination DimEmployee table. One important thing here we do is identify Business Key, which in our case is EmpID. Then click Next



4. On Slowly changing dimension columns page, we need to select appropriate change type of the Dimension Columns and here we have three types : Fixed Attribute--> No change expected Changing Attribute --> Changes are expected, but no need to record history, same record will be updated.
Historical Attribute--> If this attribute is changed, old record will be expired (by setting End Date as current date) and new record will be inserted with new attribute value

In our example, we don't expect any change for Name Attribute hence we selected this as Fixed Attribute, and rest all (Phone,Designation and City ) will be selected as Historical Attribute. Once we are done Click Next

5. On Historical Attribute Options Screen, we have two option, we can use any flag column to show which record is expired and which is most recent and other option is to use Start Date and End Date, in this example we are using second option, and also selected Start Date and End Date column appropriately.


6. For all other screens in this wizard just select next, and on last screen select Finish.
That's it.. we implemented SCD transformation...your data flow should look like as shown below.

 
If you have noticed, we have two outputs from SCD transformation, New Output and Historical Attribute Output. So if there are any new records which are not present in dimension table those records will be redirected to New Output, and all existing records with some changing attributes will be redirected to Historical Attribute output.
Running the data flow..
I have  9 records in my sample Employee Feed.xls file..







So when i run my data flow for first time, all these 9 records will be redirected to New Output  and will be inserted to DimEmployee Table.
Next, I did some changes in EmployeeFeed.xls, Changes are marked in yellow... so there are 4 records which are changed and 2 new records added.




 If you can see the data flow, two records are redirected through New Output pipeline and 4 moved through Historical Attribute output, so what happens to those 4 records is we update the EndDate to latest date, then again insert them with new changed attrribute keeping EndDate as null. as shown below.



So that's it... hope you have learned something out of this ...



Thanks
venkatesh 






No comments:

Post a Comment