Re: [loganalysis] Full vs partial pattern matches

From: Sweth Chandramouli (svcat_private)
Date: Mon Aug 20 2001 - 11:01:09 PDT

  • Next message: Nathan Bates: "[loganalysis] SLIM"

    On Fri, Aug 17, 2001 at 01:50:48PM -0400, Sweth Chandramouli wrote:
    > 	As far as performance goes, a well-written fully-anchored
    > pattern will almost always perform better than a partial pattern.  When
    > I was at Counterpane, I sent some folks an email describing this; Tina,
    > do you still have that anywhere, and do you think it would be OK to
    > forward it to the list?
    	Someone else from Counterpane forwarded it on to me, so
    I'm including it below as a quick primer on writing efficient regexes
    for those who might not have spent as much time grokking Friedl's book
    as they should have. :)
    
    
    -----Original Message-----
    From: Sweth Chandramouli 
    Sent: Wednesday, May 23, 2001 8:40 PM
    
    -----BEGIN PGP SIGNED MESSAGE-----
    Hash: SHA1
    
    > Would someone who knows a lot about regular expressions (Drew?
    > Sweth?) be  interested in laying down some basic guidelines on "how
    > to not make expensive filters"?  Could someone tell me if our
    > regular expression engine is greedy or non-greedy, and remind what
    > that even means?
    	Regex engines themselves aren't greedy or not; individual regex
    quantifiers (*, +, ?), however, can be, and in almost all
    implementations they are--that means that they will try to match as
    much as possible while still making the entire regex true.  So if you
    try the pattern "r.*s" against "There are many regexes that are hard
    to understand", it won't match "regexes", as you might think, but
    instead "re are many regexes that are hard to unders", because the *
    is greedy and thus tries to match as many "." characters (which match
    anything) as possible.
    	We're currently using the OROMatcher regex engine, which implements
    the full Perl regex libraries as of Perl 5.003, which means that we
    have a non-POSIX NFA with some odd optimizations and a few non-regex
    assertion capabilities such as lookahead (but not lookbehind); that
    doesn't mean much in this context, other than to explain (somewhat)
    why we would care about "expensive" filters or not--once we go to the
    Linnaeus engine, which uses a DFA rather than an NFA, the concept of
    "expensive" should change somewhat (and go away to a large extent).
    	Anyway.  The things that an NFA finds expensive are generally things
    that make it hard to quickly figure out when to fail when evaluating
    a specific branch of the regex parse tree; specifically, in
    approximately descending order of expensiveness:
    
    * Quantifiers, especially ones that appear towards the beginning of
    the pattern.  ".*he quick brown fox jumped over the lazy dogs"
    matches "The quick brown fox jumped over the lazy dogs", but it does
    it very poorly, because of greediness.  That ".*" first tries to
    match everything, which it does; that means that the effective
    pattern being matched the first time through is "The quick brown fox
    jumped over the lazy dogshe quick brown fox jumped over the lazy
    dogs", which obviously fails.  So then the ".*" has to backtrack, and
    see if maybe it can be just a _little_ bit less greedy, and only
    match everything but one character; "The quick brown fox jumped over
    the lazy doghe quick brown fox jumped over the lazy dogs" also
    doesn't match the string in question, however, so it has to keep
    backtracking.  It's only when the ".*" gives up on every possibility
    other than just matching that initial "T" that the regex can finally
    succeed.  
    	Quantifiers that allow open-ended instances to match are worse than
    ones that allow 0 matches, which are worse than ones that allow fixed
    ranges of matches: "x{2}" (match "xx") is less expensive than
    "x{1,3}" (match "x" or "xx" or "xxx"), which is less expensive than
    either "x?" (match "x" or nothing) or "x+" (match "x" or "xx" or
    "xxx" or "xxxx" or "xxxxx", ad infinitum).
    	For the most part, quantifiers become more expensive the more
    options they allow, so it would seem like "x?", which only allows two
    options, is about the same as "x{1,2}"; because the possibility that
    that regex component doesn't match anything often changes how every
    subsequent piece of the regex is evaluated, however, allowing 0
    matches actually becomes much more expensive the longer the string
    being matched against is.
    	Note that I didn't include "*" in the ranking above; it has all of
    the disadvantages of both ? and +, and is thus by far the worst
    quantifier.
    
    * Unanchored strings.  We can see immediately that "dog" doesn't
    match "does this string match, or doesn't it?", but an NFA can't, and
    will waste a lot of time trying to match against every three-letter
    sequence in that phrase; some highly-optimized NFAs (not ours,
    unfortunately) can realize that they only need to check "dog" against
    the "doe" that appears in the beginning, and the one in the middle,
    but even so that is two checks.  If we knew that "dog" was the entire
    string that we should be looking at, we should anchor the regex as
    "^dog$", which will only have to check once, against the very
    beginning of the string, and then give up.
    
    * Capturing.  Every time the regex engine sees an unescaped
    parenthesis, it opens up a buffer and, every time it tries a match
    from there until it sees the corresponding close paren, it pushes a
    copy of the prospective match into that buffer.  This is necessary
    for allowing backreferences, but completely unnecessary for use of
    parentheses as a grouping character (as in the "\d+\.(\d+){3}"
    example from earlier today.  There are two workarounds.  One, which
    can be used in cases like this where there is a single number of
    options in the quantifier, is to write out the string manually.  The
    other is to use non-capturing parentheses--"\d+\.(?:\d+){3}".  The
    "?:" tells the regex engine to not bother capturing, but instead just
    use those parentheses for grouping.  I think that using this would be
    a HUGE performance boost for us; however, no DFA that I no of
    supports this notation, which means that Linnaeus probably doesn't
    either, which means that we would have to rewrite any filters that
    used it for Linnaeus.  (It wouldn't be hard to script that,
    however...)
    
    * Alternation.  "foo|bar" requires a check against foo as if it were
    a full regex, and then a check aginst bar, which obviously takes
    twice as long as just checking one or the other (well, almost twice
    as long, because some info gets cached).  We don't have to worry
    about that, though, because we can't use alternation yet.  The
    pseudo-alternation that we sometimes have to use (e.g.
    (tcp)?(udp)?(icmp)?) is in some ways worse, however, because not only
    does it force multiple evaluations, but it also allows for 0-length
    matches, which cause the efficiency problems noted above, and also
    opens up the possibility that nothing will match at all.
    
    * Character classes.  These are like alternation, but not quite as
    bad because of how they are implemented; the more broad the classes
    are, however, the more they affect the performance hits of any
    quantifiers that might apply to them--in the "quick brown fox"
    example above, if the initial "." (which is effectively a character
    class that matches everything) were replaced with the smaller
    character class of "[Tt]", we would still match, but far far quicker,
    because even at its greediest, the "*" quantifier won't try to
    capture anything past the first letter.
    
    	Combining any of the above is also bad news, and the more of them
    that you have in a single regex, the worse it will perform.  The
    absolute worst thing you can do is something like matching
    "([\w]+)+", which against the right string could take longer to
    finish than the time left before the heat death of the universe. 
    
    	HTH,
    
    	Sweth.
    -----End Original Message-----
    
    -- 
    Sweth Chandramouli ; <svcat_private>
    President, Idiopathic Systems Consulting
    
    
    



    This archive was generated by hypermail 2b30 : Mon Aug 20 2001 - 11:59:02 PDT