Case
I want to use a Regular Expression to filter incorrect emailaddresses in SSIS.
Solution
There are a couple of options:
first_name last_name email_address cell_phone address_1 NK coupon test
suman kanam suman@gmail.com 123455 123456 1w23e4 1w23e4 1
venkat budati budati@gmail.com 123456 1w23e sdfffldfsj seljsek 2
sss sss ss@.com 123456 12964sdfgh fghjk dfghj 3
1) Add source
Add a Flat File Source Component in your dataflow for the file example above.
I want to use a Regular Expression to filter incorrect emailaddresses in SSIS.
Solution
There are a couple of options:
- Microsoft built a SSIS Sample Component for regex, but it's only available for 2005.
- Konesans built a RegexClean Transformation and a Regular Expression Transformation.
- There is also an open source project named RegExtractor SSIS Component.
- And you can even do it with Regular Expressions in T-SQL.
- But you can also do it your self with a Script Component Transformation
first_name last_name email_address cell_phone address_1 NK coupon test
suman kanam suman@gmail.com 123455 123456 1w23e4 1w23e4 1
venkat budati budati@gmail.com 123456 1w23e sdfffldfsj seljsek 2
sss sss ss@.com 123456 12964sdfgh fghjk dfghj 3
1) Add source
Add a Flat File Source Component in your dataflow for the file example above.
2) Add Script Component
Add a Script Component, type transformation, after your Flat File Source Component.
3) Input Columns
Edit the Script Component and select the email address column as an input column (readonly).
Input Columns tab
4) Outputs
Change the name of the default output port to CorrectEmail and set the
ExclusionGroup to 1. Add an extraoutput port for the IncorrectEmail and
set the ExclusionGroup to 1. Also select the input port as the
SynchronousInputID. Now you have two output ports. One for the correct
and one for the incorrect email address.
Inputs and Outputs tab
5) The Script
Hit the Edit Script button to open the VSTA editor. SSIS creates some standard methods, but we only use Input0_ProcessInputRow.
You can remove the rest of the methods. We are also adding a new method
that validates the email address. It uses a .Net Regex method. You can
search the web for your own regex pattern and change it in the extra
method.
// C# code
using System;
using System.Data;
using Microsoft.SqlServer.Dts.Pipeline.Wrapper;
using Microsoft.SqlServer.Dts.Runtime.Wrapper;
using System.Text.RegularExpressions;
[Microsoft.SqlServer.Dts.Pipeline.SSISScriptComponentEntryPointAttribute]
public class ScriptMain : UserComponent
{
public override void PreExecute()
{
base.PreExecute();
/*
Add your code here for preprocessing or remove if not needed
*/
}
public override void PostExecute()
{
base.PostExecute();
/*
Add your code here for postprocessing or remove if not needed
You can set read/write variables here, for example:
Variables.MyIntVar = 100
*/
}
public override void Input0_ProcessInputRow(Input0Buffer Row)
{
// Validate the email address
if (IsCorrectEmail(Row.emailaddress))
{
// Redirect the correct email address
Row.DirectRowToCorrectEmail();
}
else
{
// Redirect the incorrect email address
Row.DirectRowToInCorrectEmail();
}
}
public bool IsCorrectEmail(String emailAddress)
{
// The pattern for email
string emailAddressPattern = @"^(([^<>()[\]\\.,;:\s@\""]+"
+ @"(\.[^<>()[\]\\.,;:\s@\""]+)*)|(\"".+\""))@"
+ @"((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}"
+ @"\.[0-9]{1,3}\])|(([a-zA-Z\-0-9]+\.)+"
+ @"[a-zA-Z]{2,}))$";
// Create a regex object with the pattern
Regex emailAddressRegex = new Regex(emailAddressPattern);
// Check if it is match and return that value (boolean)
return emailAddressRegex.IsMatch(emailAddress);
}
}
//VB.Net code (SSIS 2008)
Imports
System
Imports
System.Data
Imports
System.Math
Imports
System.Text.RegularExpressions
' Added
Imports
Microsoft.SqlServer.Dts.Pipeline.Wrapper
Imports
Microsoft.SqlServer.Dts.Runtime.Wrapper
<microsoft.sqlserver.dts.pipeline.ssisscriptcomponententrypointattribute> _
<clscompliant(false)> _
Public
Class
ScriptMain
Inherits
UserComponent
' Method that will be executed for each row.
Public
Overrides
Sub
Input0_ProcessInputRow(
ByVal
Row
As
Input0Buffer)
' Validate the email address
If
(IsCorrectEmail(Row.Email))
Then
' Redirect the correct email address
Row.DirectRowToCorrectEmail()
Else
' Redirect the incorrect email address
Row.DirectRowToIncorrectEmail()
End
If
End
Sub
' A boolean method that validates an email address
' with a regex pattern.
Public
Function
IsCorrectEmail(
ByVal
emailAddress
As
String
)
As
Boolean
' The pattern for email
Dim
emailAddressPattern
As
String
=
"^(([^<>()[\]\\.,;:\s@\""
]+" + _
"(\.[^<>()[\]\\.,;:\s@\""
]+)*)|(\
""
.+\
""
))@" + _
"((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}"
+ _
"\.[0-9]{1,3}\])|(([a-zA-Z\-0-9]+\.)+"
+ _
"[a-zA-Z]{2,}))$"
' Create a regex object with the pattern
Dim
emailAddressRegex
As
Regex =
New
Regex(emailAddressPattern)
' Check if it is match and return that value (boolean)
Return
emailAddressRegex.IsMatch(emailAddress)
End
Function
End
Class
</clscompliant(false)></microsoft.sqlserver.dts.pipeline.ssisscriptcomponententrypointattribute>
6) The result
For testing purposes I added two Derived Columns with data viewes on the Data Flow Paths.
vbudati,
alias SSIS Junkie, has an other example in VB.net available. And you can also use Regular Expression to clean column values. More about that is described in this article.
thanks
venkat.budati
This comment has been removed by the author.
ReplyDelete