TAGS :Viewed: 6 - Published at: a few seconds ago

[ Unintended comma matching in \d with Python regex ]

I'm trying to build a regex to match strings like "<String, Number>" in Python. The regex should reject any string which hasn't a single integer after the comma and before the closing bracket (is that called bracket in english?). However, the \d and [0-9] regexes match inner commas, so for example I'm getting matches for

re.match('&lt;.+?,\d+&gt;', '&lt;Year,4,4&gt;')

but, curiously enough, not for

re.match('&lt;.+?,\d+&gt;', '&lt;Year,4.4&gt;')

Am I wrong or \d isn't supposed to match a comma? I really don't think this is the case, but I'm in Argentina so the decimal point is actually the comma here and not the point, could python be using my system language to figure that out and taking the comma into consideration for "numbers"? However, as far as I understand, this woulnd't be the case for \d, since it shouldn't include anything but [0-9], am I right?

Can somebody help me work this out? I'm running Python 2.7.3

Context: If only a string is contained between the < and > chars, save the whole thing as a symbol, else save just the string before the comma and use the number to build a regex

def compileRegex(self, r=''):
    r += '//'
    symbols = []
    numeratedTracker = r'&lt;(.+?),(\d+)&gt;'
    simpleTraker = r'&lt;(.+?)&gt;'
    preSymbols = re.findall('&lt;.+?&gt;', r)
    # Fill the symbol list and build the tracking regex
    for s in preSymbols:
        # Numerated symbol
        numeratedMatch = re.match(numeratedTracker, s)
        if numeratedMatch:
            symbolAndNumber = s[1:-1].split(',')
            symbols.append(symbolAndNumber[0])
            subregex = '.{' + symbolAndNumber[1] + '}'

        else:
            symbols.append(s[1:-1])
            subregex = '(.+?)'
        re.sub(s, subregex, r)
    return symbols, r

Thanks!

Answer 1


The \d is not matching the comma. The .+? is matching the first comma, because . matches any character, including a comma. If you don't want to allow commas in the "string" part, exclude them using a regex like r"<[^,]+,\d+>".

Note that the non-greedy qualifier is not helping you here. Using .+? means it will try to match as few characters as possible, but it still will match as many as it needs to in order to make the entire regex match, if it can. Since you are asking it to match \d after a comma, it will still consume the first comma in order to get to a point where it can match ,\d.

Answer 2


What you've encountered, and what's making your regex fail, is called backtracking.

Look at the step the regex engine takes here:

#1. (<).+?,\d+>
<Year,4,4>
^
#2. <(.+?),\d+> leads us up to
<Year,4,4>
    ^
#3. <.+?(,\d+)>
<Year,4,4>     # because of the lazy quantifier, the regex gives priority to `,` 
      ^        # before the `.+?` repetition: as soon as it find one it tries and
               # gets out of the `.+?` loop
#4. <.+?,\d+(>)
<Year,4,4>     # failure and backtracking, the regex engine comes back to the 
       x       # last position where it had a choice, aka step 2:
#2. <(.+?),\d+>
<Year,4,4>
    ^        
#5. <(.+?),\d+>
<Year,4,4>     # this time it tries the other possibility: 
      ^        # the first `,` is matched inside the `.+?`
#6. <.+?(,\d+>)
<Year,4,4>
         ^     # an overall match is found

Before declaring failure, a regex engine must try all different possibilities. It had a choice to stop the .+? loop either at the first or the second comma, the second one worked, it returns a found match.

As BrenBarn said, to avoid this behavior you have to force it to consider the first comma, so this is indeed the way to go ([^,] meaning any character but a comma):

<[^,]+,\d+>