Previous

Next


20. Greediness and alternation

  • Remember how I said there was a Traditional NFA and a POSIX NFA engine?
  • Remember also our "earliest match wins" rule?
  • Well, that rule isn't quite right - it should also state that not only should the earliest match win, but of any potential matches at that point, it should be the longest one that wins - which is clear enough, given all the time we've spent looking at greediness.
  • So, if you like, we can re-name this rule to be the "longest leftmost" rule - we want the longest match that occurs the farthest to the left.
  • This "longest leftmost" rule is a requirement for POSIX regex engines - and Traditional NFA engines aren't POSIX NFA engines. A classic example of a Traditional NFA engine's non-POSIXness is with greediness and alternation.
  • Consider the following target:
      It's bad, isn't it?
    
    and use the following regex:
      (is|isn't)
    
  • Here's what would happen:
      Position in the regex | Position in the string
      ----------------------------------------------
       (is|isn't)           | It's bad, isn't it?
        ^                   |^
      ----------------------------------------------
       (is|isn't)           | It's bad, isn't it?
        ^                   | ^
      ----------------------------------------------
       (is|isn't)           | It's bad, isn't it?
        ^                   |  ^
      ----------------------------------------------
    
      etc.
    
      ----------------------------------------------
       (is|isn't)           | It's bad, isn't it?
        ^                   |           ^
      ----------------------------------------------
       (is|isn't)           | It's bad, isn't it?
        ^^                  |           ^^
      ----------------------------------------------
    
  • Which is a match, and a Traditional NFA engine will stop at this point, even though a longer match is possible.
  • Sure, by re-writing the regex to be:
      (isn't|is)
    
    you would get the longest leftmost, but you can see that it would be pretty easy to slip up, and not get that result.

Previous

Next

Andrew Hill

For LinuxSA Meeting, 17 April 2001