Re: [logs] perl question relating to log analysis

Andy_Bachat_private

I asked a perl group expert ...

"To understand recursion, we must first understand recursion."
----- Forwarded by Andy Bach/WIWB/07/USCOURTS on 08/29/02 02:28 PM -----

Quoting Adam Rice wysiwygat_private

I've done a lot of log-munging in Perl, and I must report that for any
significant amount of logs, regexps just aren't fast enough. In some cases
I've found a solution using index() and rindex() that was adequate. But 
once
you get to that level of optimisation, Perl becomes as ugly as C, and the 
C
solution is generally more flexible (because it doesn't have to be
hand-optimised to death to achieve acceptable speed).

If you have to use regexps, it's worth tinkering with them. Often with
careful use of character classes, you can save Perl from having to do
backtracking. Try to avoid anchoring from the end of the string... it 
looks
like it should be fast, but in my experience it isn't. Anchor to the start
of the string where it makes sense, but not if it makes the regexp more
complicated. Complex regular expressions are really slow, so try breaking
them down into several smaller ones.

On the other hand, for doing ad-hoc queries against server logs, Perl is
usually the language of choice. Cute tip: since the grep variants are way
faster than Perl, use them to narrow the field before Perl does the grunt
work. Say you want a list of JPEG files larger than 200k, together with 
how
often they were served:

zgrep -F ".jpg" logfile.gz | egrep ' [0-9][0-9][0-9][0-9][0-9][0-9] ' | 
perl
-ne 'print "$1\t$2\n" if / "GET (\/[^\s\"]+)[^"]*" \d+ (\d+) / && 
$2>200*1024'
| sort | uniq -c

Always test on a subset of your logs first! Where I work, a command like
this will take an hour on a full month's logs, and you'll be very annoyed 
if
you wait that long to discover you made a typo.

Tip 3: "top" is good for getting an immediate idea of how efficient your
command is. Ideally you want the "gzip" process using 80% or more of the
CPU. If it's only pulling 20%, it'll take four times as long. With a
multi-stage pipe like this, you can easily see which stage is the 
bottleneck.

Adam

-- 
Adam Rice -- wysiwygat_private -- Blackburn, Lancashire, 
England

_______________________________________________
LogAnalysis mailing list
LogAnalysisat_private
http://lists.shmoo.com/mailman/listinfo/loganalysis