Re: [loganalysis] Full vs partial pattern matches

From: Sweth Chandramouli (svcat_private)
Date: Mon Aug 20 2001 - 11:01:09 PDT
Next message: Nathan Bates: "[loganalysis] SLIM"
Previous message: Sweth Chandramouli: "Re: [loganalysis] SIDS 0.20"
In reply to: Sweth Chandramouli: "Re: [loganalysis] Full vs partial pattern matches"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Fri, Aug 17, 2001 at 01:50:48PM -0400, Sweth Chandramouli wrote:
> 	As far as performance goes, a well-written fully-anchored
> pattern will almost always perform better than a partial pattern.  When
> I was at Counterpane, I sent some folks an email describing this; Tina,
> do you still have that anywhere, and do you think it would be OK to
> forward it to the list?
	Someone else from Counterpane forwarded it on to me, so
I'm including it below as a quick primer on writing efficient regexes
for those who might not have spent as much time grokking Friedl's book
as they should have. :)


-----Original Message-----
From: Sweth Chandramouli 
Sent: Wednesday, May 23, 2001 8:40 PM

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

> Would someone who knows a lot about regular expressions (Drew?
> Sweth?) be  interested in laying down some basic guidelines on "how
> to not make expensive filters"?  Could someone tell me if our
> regular expression engine is greedy or non-greedy, and remind what
> that even means?
	Regex engines themselves aren't greedy or not; individual regex
quantifiers (*, +, ?), however, can be, and in almost all
implementations they are--that means that they will try to match as
much as possible while still making the entire regex true.  So if you
try the pattern "r.*s" against "There are many regexes that are hard
to understand", it won't match "regexes", as you might think, but
instead "re are many regexes that are hard to unders", because the *
is greedy and thus tries to match as many "." characters (which match
anything) as possible.
	We're currently using the OROMatcher regex engine, which implements
the full Perl regex libraries as of Perl 5.003, which means that we
have a non-POSIX NFA with some odd optimizations and a few non-regex
assertion capabilities such as lookahead (but not lookbehind); that
doesn't mean much in this context, other than to explain (somewhat)
why we would care about "expensive" filters or not--once we go to the
Linnaeus engine, which uses a DFA rather than an NFA, the concept of
"expensive" should change somewhat (and go away to a large extent).
	Anyway.  The things that an NFA finds expensive are generally things
that make it hard to quickly figure out when to fail when evaluating
a specific branch of the regex parse tree; specifically, in
approximately descending order of expensiveness:

* Quantifiers, especially ones that appear towards the beginning of
the pattern.  ".*he quick brown fox jumped over the lazy dogs"
matches "The quick brown fox jumped over the lazy dogs", but it does
it very poorly, because of greediness.  That ".*" first tries to
match everything, which it does; that means that the effective
pattern being matched the first time through is "The quick brown fox
jumped over the lazy dogshe quick brown fox jumped over the lazy
dogs", which obviously fails.  So then the ".*" has to backtrack, and
see if maybe it can be just a _little_ bit less greedy, and only
match everything but one character; "The quick brown fox jumped over
the lazy doghe quick brown fox jumped over the lazy dogs" also
doesn't match the string in question, however, so it has to keep
backtracking.  It's only when the ".*" gives up on every possibility
other than just matching that initial "T" that the regex can finally
succeed.  
	Quantifiers that allow open-ended instances to match are worse than
ones that allow 0 matches, which are worse than ones that allow fixed
ranges of matches: "x{2}" (match "xx") is less expensive than
"x{1,3}" (match "x" or "xx" or "xxx"), which is less expensive than
either "x?" (match "x" or nothing) or "x+" (match "x" or "xx" or
"xxx" or "xxxx" or "xxxxx", ad infinitum).
	For the most part, quantifiers become more expensive the more
options they allow, so it would seem like "x?", which only allows two
options, is about the same as "x{1,2}"; because the possibility that
that regex component doesn't match anything often changes how every
subsequent piece of the regex is evaluated, however, allowing 0
matches actually becomes much more expensive the longer the string
being matched against is.
	Note that I didn't include "*" in the ranking above; it has all of
the disadvantages of both ? and +, and is thus by far the worst
quantifier.

* Unanchored strings.  We can see immediately that "dog" doesn't
match "does this string match, or doesn't it?", but an NFA can't, and
will waste a lot of time trying to match against every three-letter
sequence in that phrase; some highly-optimized NFAs (not ours,
unfortunately) can realize that they only need to check "dog" against
the "doe" that appears in the beginning, and the one in the middle,
but even so that is two checks.  If we knew that "dog" was the entire
string that we should be looking at, we should anchor the regex as
"^dog$", which will only have to check once, against the very
beginning of the string, and then give up.

* Capturing.  Every time the regex engine sees an unescaped
parenthesis, it opens up a buffer and, every time it tries a match
from there until it sees the corresponding close paren, it pushes a
copy of the prospective match into that buffer.  This is necessary
for allowing backreferences, but completely unnecessary for use of
parentheses as a grouping character (as in the "\d+\.(\d+){3}"
example from earlier today.  There are two workarounds.  One, which
can be used in cases like this where there is a single number of
options in the quantifier, is to write out the string manually.  The
other is to use non-capturing parentheses--"\d+\.(?:\d+){3}".  The
"?:" tells the regex engine to not bother capturing, but instead just
use those parentheses for grouping.  I think that using this would be
a HUGE performance boost for us; however, no DFA that I no of
supports this notation, which means that Linnaeus probably doesn't
either, which means that we would have to rewrite any filters that
used it for Linnaeus.  (It wouldn't be hard to script that,
however...)

* Alternation.  "foo|bar" requires a check against foo as if it were
a full regex, and then a check aginst bar, which obviously takes
twice as long as just checking one or the other (well, almost twice
as long, because some info gets cached).  We don't have to worry
about that, though, because we can't use alternation yet.  The
pseudo-alternation that we sometimes have to use (e.g.
(tcp)?(udp)?(icmp)?) is in some ways worse, however, because not only
does it force multiple evaluations, but it also allows for 0-length
matches, which cause the efficiency problems noted above, and also
opens up the possibility that nothing will match at all.

* Character classes.  These are like alternation, but not quite as
bad because of how they are implemented; the more broad the classes
are, however, the more they affect the performance hits of any
quantifiers that might apply to them--in the "quick brown fox"
example above, if the initial "." (which is effectively a character
class that matches everything) were replaced with the smaller
character class of "[Tt]", we would still match, but far far quicker,
because even at its greediest, the "*" quantifier won't try to
capture anything past the first letter.

	Combining any of the above is also bad news, and the more of them
that you have in a single regex, the worse it will perform.  The
absolute worst thing you can do is something like matching
"([\w]+)+", which against the right string could take longer to
finish than the time left before the heat death of the universe. 

	HTH,

	Sweth.
-----End Original Message-----

-- 
Sweth Chandramouli ; <svcat_private>
President, Idiopathic Systems Consulting
application/pgp-signature attachment: stored
Next message: Nathan Bates: "[loganalysis] SLIM"
Previous message: Sweth Chandramouli: "Re: [loganalysis] SIDS 0.20"
In reply to: Sweth Chandramouli: "Re: [loganalysis] Full vs partial pattern matches"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
This archive was generated by hypermail 2b30 : Mon Aug 20 2001 - 11:59:02 PDT