Slowly changing
dimension transformation was one of the mystery transformation for me
till now, as i never had chance to actually implement it. So today i
block my time to implement this transformation by taking a simple
example.
The concept of slowly changing dimension is explained very well and in detail on wikipedia, i would request readers to have a look at below link if you are not aware of SCD concept.
http://en.wikipedia.org/wiki/Slowly_changing_dimension
Before starting on our example, let me give you some glance of the SCD
transformation provided in SSIS 2008. This transformation is used to implement Type 1 and Type 2 SCD, for other types we need to add some custom logic to our ETL. (Type 1: Identify the changes and update the record, no history will be recorded, Type 2: Any change identified we expire the old record and create new record with new values, here we save history information in old record)
OK.. Lets take simple Employee dimension example... in this example i am getting EmployeeFeed.xls file as input for my DimEmployee table (which is my dimension table) and i am using SCD transformation to identify any changes and implement DimEmployee as Type 2.
DimEmployee :
Create table DimEmployee
(
EmpKey int identity(1,1),
EmpId int,
Name varchar(100),
Designation varchar(100),
City varchar(100),
Phone varchar(10),
StartDate datetime,
EndDate datetime
)
So before we start implementing any SCD, we need to first identify attribute in the dimension table for which we need to track changes. In this example i want to track changes for Designation, City and Phone attributes. I am expecting no change in Name attribute or column.
You might have noticed that there are two columns EmpId and EmpKey why these columns are needed in dimension table??
Ans:
EmpId : This is a Business Key, which uniquely identifies a employee in entire data warehouse system.
EmpKey : Is a Surrogate key, which uniquely identifies record in dimension table, and also its a key to identify historical records.
We also have two more columns StartDate and EndDate, these two columns are used to track time of changes, if EndDate is null it means the record is most recent record in dimension table.
Steps to Implement SCD in a data flow.
1. After we add Source (which is excel in our case EmployeeFeed.xls), we need to add Data conversion transformation, to correct if there are any Data type conflicts.
2. Then we add SCD transformation to Data flow, and this will open SCD wizard, Click next on welcome screen.
In our example, we don't expect any change for Name Attribute hence we selected this as Fixed Attribute, and rest all (Phone,Designation and City ) will be selected as Historical Attribute. Once we are done Click Next
The concept of slowly changing dimension is explained very well and in detail on wikipedia, i would request readers to have a look at below link if you are not aware of SCD concept.
http://en.wikipedia.org/wiki/Slowly_changing_dimension
Before starting on our example, let me give you some glance of the SCD
transformation provided in SSIS 2008. This transformation is used to implement Type 1 and Type 2 SCD, for other types we need to add some custom logic to our ETL. (Type 1: Identify the changes and update the record, no history will be recorded, Type 2: Any change identified we expire the old record and create new record with new values, here we save history information in old record)
OK.. Lets take simple Employee dimension example... in this example i am getting EmployeeFeed.xls file as input for my DimEmployee table (which is my dimension table) and i am using SCD transformation to identify any changes and implement DimEmployee as Type 2.
DimEmployee :
Create table DimEmployee
(
EmpKey int identity(1,1),
EmpId int,
Name varchar(100),
Designation varchar(100),
City varchar(100),
Phone varchar(10),
StartDate datetime,
EndDate datetime
)
So before we start implementing any SCD, we need to first identify attribute in the dimension table for which we need to track changes. In this example i want to track changes for Designation, City and Phone attributes. I am expecting no change in Name attribute or column.
You might have noticed that there are two columns EmpId and EmpKey why these columns are needed in dimension table??
Ans:
EmpId : This is a Business Key, which uniquely identifies a employee in entire data warehouse system.
EmpKey : Is a Surrogate key, which uniquely identifies record in dimension table, and also its a key to identify historical records.
We also have two more columns StartDate and EndDate, these two columns are used to track time of changes, if EndDate is null it means the record is most recent record in dimension table.
Steps to Implement SCD in a data flow.
1. After we add Source (which is excel in our case EmployeeFeed.xls), we need to add Data conversion transformation, to correct if there are any Data type conflicts.
2. Then we add SCD transformation to Data flow, and this will open SCD wizard, Click next on welcome screen.
3. On Select a Dimensoin
Table And Keys page, select your dimension table in this case its
DimEmployee, Map all the columns from source excel to destination
DimEmployee table. One important thing here we do is identify Business
Key, which in our case is EmpID. Then click Next
4. On Slowly changing
dimension columns page, we need to select appropriate change type of the
Dimension Columns and here we have three types : Fixed Attribute--> No change expected
Changing Attribute --> Changes are expected, but no need to record history, same record will be updated.
Historical
Attribute--> If this attribute is changed, old record will be expired
(by setting End Date as current date) and new record will be inserted
with new attribute valueIn our example, we don't expect any change for Name Attribute hence we selected this as Fixed Attribute, and rest all (Phone,Designation and City ) will be selected as Historical Attribute. Once we are done Click Next
5. On Historical
Attribute Options Screen, we have two option, we can use any flag column
to show which record is expired and which is most recent and other
option is to use Start Date and End Date, in this example we
are using second option, and also selected Start Date and End Date column
appropriately.
6. For all other screens in this wizard just select next, and on last screen select Finish.
That's it.. we implemented SCD transformation...your data flow should look like as shown below.
If
you have noticed, we have two outputs from SCD transformation, New
Output and Historical Attribute Output. So if there are any new records
which are not present in dimension table those records will be
redirected to New Output, and all existing records with some changing
attributes will be redirected to Historical Attribute output.
Running the data flow..
I have 9 records in my sample Employee Feed.xls file..
So when i run my data
flow for first time, all these 9 records will be redirected to New
Output and will be inserted to DimEmployee Table.
Next, I did some changes
in EmployeeFeed.xls, Changes are marked in yellow... so there are 4
records which are changed and 2 new records added.
If you can see the data
flow, two records are redirected through New Output pipeline and 4 moved
through Historical Attribute output, so what happens to those 4 records
is we update the EndDate to latest date, then again insert them with
new changed attrribute keeping EndDate as null. as shown below.
So that's it... hope you have learned something out of this ...
Thanks
venkatesh
No comments:
Post a Comment