On Fri, Aug 17, 2001 at 01:50:48PM -0400, Sweth Chandramouli wrote: > As far as performance goes, a well-written fully-anchored > pattern will almost always perform better than a partial pattern. When > I was at Counterpane, I sent some folks an email describing this; Tina, > do you still have that anywhere, and do you think it would be OK to > forward it to the list? Someone else from Counterpane forwarded it on to me, so I'm including it below as a quick primer on writing efficient regexes for those who might not have spent as much time grokking Friedl's book as they should have. :) -----Original Message----- From: Sweth Chandramouli Sent: Wednesday, May 23, 2001 8:40 PM -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 > Would someone who knows a lot about regular expressions (Drew? > Sweth?) be interested in laying down some basic guidelines on "how > to not make expensive filters"? Could someone tell me if our > regular expression engine is greedy or non-greedy, and remind what > that even means? Regex engines themselves aren't greedy or not; individual regex quantifiers (*, +, ?), however, can be, and in almost all implementations they are--that means that they will try to match as much as possible while still making the entire regex true. So if you try the pattern "r.*s" against "There are many regexes that are hard to understand", it won't match "regexes", as you might think, but instead "re are many regexes that are hard to unders", because the * is greedy and thus tries to match as many "." characters (which match anything) as possible. We're currently using the OROMatcher regex engine, which implements the full Perl regex libraries as of Perl 5.003, which means that we have a non-POSIX NFA with some odd optimizations and a few non-regex assertion capabilities such as lookahead (but not lookbehind); that doesn't mean much in this context, other than to explain (somewhat) why we would care about "expensive" filters or not--once we go to the Linnaeus engine, which uses a DFA rather than an NFA, the concept of "expensive" should change somewhat (and go away to a large extent). Anyway. The things that an NFA finds expensive are generally things that make it hard to quickly figure out when to fail when evaluating a specific branch of the regex parse tree; specifically, in approximately descending order of expensiveness: * Quantifiers, especially ones that appear towards the beginning of the pattern. ".*he quick brown fox jumped over the lazy dogs" matches "The quick brown fox jumped over the lazy dogs", but it does it very poorly, because of greediness. That ".*" first tries to match everything, which it does; that means that the effective pattern being matched the first time through is "The quick brown fox jumped over the lazy dogshe quick brown fox jumped over the lazy dogs", which obviously fails. So then the ".*" has to backtrack, and see if maybe it can be just a _little_ bit less greedy, and only match everything but one character; "The quick brown fox jumped over the lazy doghe quick brown fox jumped over the lazy dogs" also doesn't match the string in question, however, so it has to keep backtracking. It's only when the ".*" gives up on every possibility other than just matching that initial "T" that the regex can finally succeed. Quantifiers that allow open-ended instances to match are worse than ones that allow 0 matches, which are worse than ones that allow fixed ranges of matches: "x{2}" (match "xx") is less expensive than "x{1,3}" (match "x" or "xx" or "xxx"), which is less expensive than either "x?" (match "x" or nothing) or "x+" (match "x" or "xx" or "xxx" or "xxxx" or "xxxxx", ad infinitum). For the most part, quantifiers become more expensive the more options they allow, so it would seem like "x?", which only allows two options, is about the same as "x{1,2}"; because the possibility that that regex component doesn't match anything often changes how every subsequent piece of the regex is evaluated, however, allowing 0 matches actually becomes much more expensive the longer the string being matched against is. Note that I didn't include "*" in the ranking above; it has all of the disadvantages of both ? and +, and is thus by far the worst quantifier. * Unanchored strings. We can see immediately that "dog" doesn't match "does this string match, or doesn't it?", but an NFA can't, and will waste a lot of time trying to match against every three-letter sequence in that phrase; some highly-optimized NFAs (not ours, unfortunately) can realize that they only need to check "dog" against the "doe" that appears in the beginning, and the one in the middle, but even so that is two checks. If we knew that "dog" was the entire string that we should be looking at, we should anchor the regex as "^dog$", which will only have to check once, against the very beginning of the string, and then give up. * Capturing. Every time the regex engine sees an unescaped parenthesis, it opens up a buffer and, every time it tries a match from there until it sees the corresponding close paren, it pushes a copy of the prospective match into that buffer. This is necessary for allowing backreferences, but completely unnecessary for use of parentheses as a grouping character (as in the "\d+\.(\d+){3}" example from earlier today. There are two workarounds. One, which can be used in cases like this where there is a single number of options in the quantifier, is to write out the string manually. The other is to use non-capturing parentheses--"\d+\.(?:\d+){3}". The "?:" tells the regex engine to not bother capturing, but instead just use those parentheses for grouping. I think that using this would be a HUGE performance boost for us; however, no DFA that I no of supports this notation, which means that Linnaeus probably doesn't either, which means that we would have to rewrite any filters that used it for Linnaeus. (It wouldn't be hard to script that, however...) * Alternation. "foo|bar" requires a check against foo as if it were a full regex, and then a check aginst bar, which obviously takes twice as long as just checking one or the other (well, almost twice as long, because some info gets cached). We don't have to worry about that, though, because we can't use alternation yet. The pseudo-alternation that we sometimes have to use (e.g. (tcp)?(udp)?(icmp)?) is in some ways worse, however, because not only does it force multiple evaluations, but it also allows for 0-length matches, which cause the efficiency problems noted above, and also opens up the possibility that nothing will match at all. * Character classes. These are like alternation, but not quite as bad because of how they are implemented; the more broad the classes are, however, the more they affect the performance hits of any quantifiers that might apply to them--in the "quick brown fox" example above, if the initial "." (which is effectively a character class that matches everything) were replaced with the smaller character class of "[Tt]", we would still match, but far far quicker, because even at its greediest, the "*" quantifier won't try to capture anything past the first letter. Combining any of the above is also bad news, and the more of them that you have in a single regex, the worse it will perform. The absolute worst thing you can do is something like matching "([\w]+)+", which against the right string could take longer to finish than the time left before the heat death of the universe. HTH, Sweth. -----End Original Message----- -- Sweth Chandramouli ; <svcat_private> President, Idiopathic Systems Consulting
This archive was generated by hypermail 2b30 : Mon Aug 20 2001 - 11:59:02 PDT