High-Performance Development - C#, VB, SQL, ASP.NET

Slogan Text

This article gives several tips on how to debug regular expressions in ASP.NET applications.

Debugging long and complex regular expressions can be challenging and time consuming. Using these tips can increase your debugging effectiveness and shorten your debugging time.

Chunk Long Regular Expressions into Short Ones

Where is the bug in your regular expression? It can be challenging to locate bugs in long complex regular expressions.

Bad Example
public void Parse(string Input)
{
    string Pattern;   
    
    Pattern = "January|...|December [0-9]{1,2}, [0-9]{4} at ([0-9]{1,2}):([0-9]{2}) (AM|PM)";
    
    MatchObj = Regex.Match(Input, Pattern);
}
Solution

Chunk long regular expressions into short regular expressions.

Start debugging by including only the first pattern and commenting out the remaining patterns. Once you have the first pattern debugged add the next pattern.

Better Example
public void Parse(string Input)
{
    string Pattern;
    string PatternAmPm;
    string PatternDay;
    string PattermHours;
    string PatternMinutes;
    string PatternMonth;
    string PatternSeconds;
    string PatternYear;
    
    PatternMonth = "January|...|December";
    PatternDay = "[0-9]{1,2}", ";
    PatternYear = "[0-9]{4} at ";
    PatternHours = "([0-9]{1,2}):";
    PatternMinutes = "([0-9]{2}):";
    PatternSeconds = "([0-9]{2})";
    PatternAmPm = "(AM|PM)";
    
    Pattern = PatternMonth + PatternDay + PatternYear;
    Pattern += PatternHours + PatternMinutes + PatternSeconds + PatternAmPm;
    
    MatchObj = Regex.Match(MatchObj, Pattern);
}

Dump Match Object

The Regex.Match method returns a Match object containing the parsed elements in the Groups and Captures collections. View the contents of these collections to determine whether the regular expression correctly parsed the input string. However, the Visual Studio (2005) Local debug window does not display the contents of the Groups and Captures collections.

Solution

Create a Dump method which accepts an Regex Match object as an input parameter, walks through the Groups and Captures collections, and gets the desired data values.

public void Dump(Match MatchObj)
{
    int c;
    int g;
    string Value;
    
    for (int g = 0; g < MatchObj.Groups.Count; g++)
    {
        GroupObj = MatchObj.Groups[g];
        for (int c = 0; c < GroupObj.Captures.Count; c++)
        {
            CaptureObj = GroupObj.Captures[c];
            Value = CaptureObj.Value;        
        } // <-- Set breakpoint here.
    }
}

Call the Dump method and pass the Match object immediately after calling Regex.Match:

public bool Parse(string Input)
{
    Match MatchObj;
    string Pattern = "[0-9]*";
    
    MatchObj = Regex.Match(Input, Pattern);
    Dump(MatchObj);
}

Set a breakpoint in the Dump method immediately after getting the Capture value (location indicated by the <-- in the Dump method).

Start your application in debug mode. When the debugger stops at the breakpoint view Value in the locals window. Repeat executing the loop until you've walked all through the entire collection.
Additional Suggestions

You may want to update the Dump method to write the match information to your trace output stream.

Capture Non-Relevant Substrings

Many times you want only a few pieces of data from a string. The remaining items are not required for your application. On approach is to write a regular expression to match the relevant and non-relevant substrings but only capture the relevant substring using the "()" capture expressions. Should Regex.Match fail matching on the non-relevant substrings you don't know where the failure ocurred.

Bad Example

Suppose we want to parse the time from a string like: January 15, 2008 at 12:43:04 PM


We could create a regular expression to capture only the time at the end of the string and not capture the date components at the beginning of the string like:

    string Pattern;
    string Pattern1;
    string Pattern2;
    
    Pattern1 = "January|...|December";
    Pattern2 = "[0-9]{1,2}, [0-9]{4} at ([0-9]{1,2}):([0-9]{2}):([0-9]{2}) (AM|PM)";
    Pattern = Pattern1 + Pattern2;
    
    MatchObj = Regex.Match(MatchObj, Pattern);
Solution

The solution is to capture non-relevant substrings. By doing so you are able to view in the debugger what substrings Regex.Match was able to match and where it stopped in the matching process.

Better Example

We've added capture expressions on the month names, day of month, and year regular expressions sub elements:

    string Pattern;
    string Pattern1;
    string Pattern2;
    
    Pattern1 = "(January|...|December)";
    Pattern2 = "([0-9]{1,2}), ([0-9]{4}) at ([0-9]{1,2}):([0-9]{2}):([0-9]{2}) (AM|PM)";
    Pattern = Pattern1 + Pattern2;
    
    MatchObj = Regex.Match(MatchObj);

Capturing delimiters might be useful too.

Use Regex.Match not Regex.IsMatch

Regex.IsMatch gives you a pass/fail on whether the input string matched the pattern. This general result is good once you have created and tested the regular expression. However, when debugging the regular expression you don't know what part of the regular expression failed to match a good input string.

Bad Example
    bool Status;
    
    Status = Regex.IsMatch(Input, Pattern);
Solution

Use Regex.Match instead of Regex.IsMatch and use the Dump solution given above to view the captures.

Better Example
    Match MatchObj;
    
    MatchObj = Regex.Match(Input, Pattern);
    Dump(MatchObj);

Use Named Indexes For Groups and Captures

As you add or remove subcomponents to your regular expression the index position of the captured strings will change in the Groups and Captures collections. If you use numerical indices you'll spend a lot of wasted time updating the indices.

A second problem: You can't tell from the numerical indices what the associated value is. Is index "2" the "Month", or the "Year"?

Bad Example

Suppose we are parsing a date string like "05/23/03":

public DateTime DateParse(string Input)
{
    string Day;
    Match MatchObj;
    string Month;
    string Year;
    
    MatchObj = new Regex.Match(Input, Pattern);
    Month = MatchObj.Groups[1].Captures[0].Value;
    Day = MatchObj.Groups[2].Captures[0].Value;
    Year = MatchObj.Groups[3].Captures[0].Value;           
}

If we update the regular expression to parse additional substrings at the beginning of the input string we'll need to update the Group collection indices.

Solution

The solution is to use named indexes, such as, variables to index into the Groups collection.

Better Example
    private int m_Day = 1;
    private int m_Month = 2;
    private int m_Year = 3;
            
public DateTime DateParse(string Input)
{
    string Day;
    Match MatchObj;
    string Month;
    string Year;
    
    MatchObj = new Regex.Match(Input, Pattern);
    Month = MatchObj.Groups[m_Day].Captures[0].Value;
    Day = MatchObj.Groups[m_Month].Captures[0].Value;
    Year = MatchObj.Groups[m_Year].Captures[0].Value;           
}

This tip is valuable for long complex regular expressions.

Check Count or Success

Do we assume the input string will be correctly formatted and Regex.Match will always succeed?

Bad practice. We should always assume the input string can contain invalid data and Regex.Match will fail.

Bad Example

Here we assume the match succeeded and we grab the expected values from the Groups and Captures collections.

    private int m_Day = 1;
    private int m_Month = 2;
    private int m_Year = 3;
            
public DateTime DateParse(string Input)
{
    string Day;
    Match MatchObj;
    string Month;
    string Year;
    
    MatchObj = new Regex.Match(Input, Pattern);
    
    Month = MatchObj.Groups[m_Day].Captures[0].Value;
    Day = MatchObj.Groups[m_Month].Captures[0].Value;
    Year = MatchObj.Groups[m_Year].Captures[0].Value;           
}
Solution

Check the Match.Success or the Groups count.

Match.Success tells whether the match passed or failed with a boolean value.

Match.Groups.Count is the count of the captures plus one for the entire string.

Better Example
    private int m_Day = 1;
    private int m_Month = 2;
    private int m_Year = 3;
            
public bool DateParse(string Input, ref DateTime Out)
{
    string Day;
    int DayAsInt;
    Match MatchObj;
    string Month;
    int MonthAsInt;
    DateTime Out;
    string Year;
    int YearAsInt;
    
    MatchObj = new Regex.Match(Input, Pattern);
    if( !MatchObj.Success) return false;
    
    // Or, we can check Groups.Count
    if( MatchObj.Groups.Count != 4) false;
    
    Month = MatchObj.Groups[m_Day].Captures[0].Value;
    Day = MatchObj.Groups[m_Month].Captures[0].Value;
    Year = MatchObj.Groups[m_Year].Captures[0].Value;
    
    MonthAsInt = Convert.ToInt32(Month);
    DayAsInt = Convert.ToInt32(Day);
    YearhAsInt = Convert.ToInt32(Year);
            
    Out = new DateTime(YearAsInt, MonthAsInt, DayAsInt);
    
    return true;           
}

Ignore Case

Most of the time (in my experience) we want to capture the substrings from the input string and ignore the case of the characters.

If you call Regex.Match, as shown in the Bad Example (below) the default behavior is to match the characters exactly. The match fails because the pattern uses uppercase characters and the input string has lowercase characters.

Also, note RegexOptions is not specified in the call to Regex.Match. The default Regex.Match action is to match characters based on case.

Bad Example
Where Input = "abc";

public bool Parse(string Input)
{
    Match MatchObj;
    MatchObj = Regex.Match(Input, "[A-Z]{3}");
}
Solution

Use RegexOptions.IgnoreCase or use lowercase characters in your regular expression pattern.

Better Example
public bool Parse(string Input)
{
    Match MatchObj;
    
    MatchObj = Regex.Match(Input, "[A-Z]{3}", RegexOptions.IgnoreCase);
    
    // Or do this:
    MatchObj = Regex.Match(Input, "[a-zA-Z]{3}");    
}

Use a Test Framework

The input string to your parsing method can have potentially hundreds of variations. Can your regular expression parse these variations successfully?

Solution

Use a test framework, such as, MbUnit to test all potential variations.