Friday, 17 August 2012

Regular Expressions in SSIS

Case
I want to use a Regular Expression to filter incorrect emailaddresses in SSIS.

Solution
There are a couple of options:

  • Microsoft built a SSIS Sample Component for regex, but it's only available for 2005.
  • Konesans built a RegexClean Transformation and a Regular Expression Transformation.
  • There is also an open source project named RegExtractor SSIS Component.
  • And you can even do it with Regular Expressions in T-SQL.
  • But you can also do it your self with a Script Component Transformation
Let's elaborate the last option and use this file as a source:



first_name  last_name email_address         cell_phone  address_1   NK   coupon    test
suman        kanam      suman@gmail.com  123455       123456    1w23e4 1w23e4  1
venkat       budati       budati@gmail.com   123456      1w23e      sdfffldfsj  seljsek   2
sss             sss           ss@.com                  123456      12964sdfgh  fghjk    dfghj     3
1) Add source
Add a Flat File Source Component in your dataflow for the file example above.

 

2) Add Script Component
Add a Script Component, type transformation, after your Flat File Source Component.
 

3) Input Columns
 Edit the Script Component and select the email address column as an input column (readonly).
Input Columns tab

4) Outputs
Change the name of the default output port to CorrectEmail and set the ExclusionGroup to 1. Add an extraoutput port for the IncorrectEmail and set the ExclusionGroup to 1. Also select the input port as the SynchronousInputID. Now you have two output ports. One for the correct and one for the incorrect email address. 
Inputs and Outputs tab
5) The Script
Hit the Edit Script button to open the VSTA editor. SSIS creates some standard methods, but we only use Input0_ProcessInputRow. You can remove the rest of the methods. We are also adding a new method that validates the email address. It uses a .Net Regex method. You can search the web for your own regex pattern and change it in the extra method.


// C# code
using System;
using System.Data;
using Microsoft.SqlServer.Dts.Pipeline.Wrapper;
using Microsoft.SqlServer.Dts.Runtime.Wrapper;
using System.Text.RegularExpressions;

[Microsoft.SqlServer.Dts.Pipeline.SSISScriptComponentEntryPointAttribute]
public class ScriptMain : UserComponent
{

    public override void PreExecute()
    {
        base.PreExecute();
        /*
          Add your code here for preprocessing or remove if not needed
        */
    }

    public override void PostExecute()
    {
        base.PostExecute();
        /*
          Add your code here for postprocessing or remove if not needed
          You can set read/write variables here, for example:
          Variables.MyIntVar = 100
        */
    }

    public override void Input0_ProcessInputRow(Input0Buffer Row)
    {
        // Validate the email address
        if (IsCorrectEmail(Row.emailaddress))
        {
            // Redirect the correct email address
            Row.DirectRowToCorrectEmail();
        }
        else
        {
            // Redirect the incorrect email address
            Row.DirectRowToInCorrectEmail();
        }
    }

    public bool IsCorrectEmail(String emailAddress)
    {
        // The pattern for email
        string emailAddressPattern = @"^(([^<>()[\]\\.,;:\s@\""]+"
                        + @"(\.[^<>()[\]\\.,;:\s@\""]+)*)|(\"".+\""))@"
                        + @"((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}"
                        + @"\.[0-9]{1,3}\])|(([a-zA-Z\-0-9]+\.)+"
                        + @"[a-zA-Z]{2,}))$";
        // Create a regex object with the pattern
        Regex emailAddressRegex = new Regex(emailAddressPattern);
        // Check if it is match and return that value (boolean)
        return emailAddressRegex.IsMatch(emailAddress);
    }

}

//VB.Net code (SSIS 2008)
Imports System
Imports System.Data
Imports System.Math
Imports System.Text.RegularExpressions    ' Added
Imports Microsoft.SqlServer.Dts.Pipeline.Wrapper
Imports Microsoft.SqlServer.Dts.Runtime.Wrapper
<microsoft.sqlserver.dts.pipeline.ssisscriptcomponententrypointattribute> _
<clscompliant(false)> _
Public Class ScriptMain
    Inherits UserComponent
    ' Method that will be executed for each row.
    Public Overrides Sub Input0_ProcessInputRow(ByVal Row As Input0Buffer)
        ' Validate the email address
        If (IsCorrectEmail(Row.Email)) Then
            ' Redirect the correct email address
            Row.DirectRowToCorrectEmail()
        Else
            ' Redirect the incorrect email address
            Row.DirectRowToIncorrectEmail()
        End If
    End Sub
    ' A boolean method that validates an email address
    ' with a regex pattern.
    Public Function IsCorrectEmail(ByVal emailAddress As String) As Boolean
        ' The pattern for email
        Dim emailAddressPattern As String = "^(([^<>()[\]\\.,;:\s@\""]+" + _
        "(\.[^<>()[\]\\.,;:\s@\""]+)*)|(\"".+\""))@" + _
        "((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}" + _
        "\.[0-9]{1,3}\])|(([a-zA-Z\-0-9]+\.)+" + _
        "[a-zA-Z]{2,}))$"
        ' Create a regex object with the pattern
        Dim emailAddressRegex As Regex = New Regex(emailAddressPattern)
        ' Check if it is match and return that value (boolean)
        Return emailAddressRegex.IsMatch(emailAddress)
    End Function
End Class
</clscompliant(false)></microsoft.sqlserver.dts.pipeline.ssisscriptcomponententrypointattribute>
6) The result
For testing purposes I added two Derived Columns with data viewes on the Data Flow Paths.
vbudati,alias SSIS Junkie, has an other example in VB.net available. And you can also use Regular Expression to clean column values. More about that is described in this article.
thanks
venkat.budati

1 comment: