Re: [logs] perl question relating to log analysis

shane@time-travellers.org

[ Warning: more Perl and/or general optimsation than log analysis ]

Russell,

On 2002-08-26 17:18:42 +1200, Russell Fulton wrote:
> I have recently reimplemented much of the functionality of Psionic's
> Logcheck in a perl script.  I have also added functionality to make it
> more useful in a central log server enviroment (you can specify
> specific checks for different hosts and have reports for different
> hosts mailed to different admins).
> 
> We are now testing it in a production enviroment, when we are happy
> with it and I have written some documentation (what's that ?? ;-) I
> will post the script to the list for others to have a play with.
> 
> My immediate concern is that the perl scripts builds functions that
> apply lots of regular expressions (REs) to each line of log files.
> 
> sub check {
>     $_ = shift;
>     study $_;   #hopefully speed up matching...
> 
>     return 0 if /re1/;
>     return 0 if /re2/;
>     return 1 if /re3/;
>     return 1 if /re4/;
>     return 1 if /re5/;
>     return 2 if /re6/;
>     return 2 if /re7/i;
>     return 3 if /re8/;
>     ...
>     return 4;
> }
> 
> return code tells the program what to do with this record.
> 
> Anyone know of any tricks to speed this up since this is the innermost
> loop of the process any gains here should be worthwhile.  I know the
> RE optimizer is pretty smart and that it will do some optimization
> over statements but I have never figured out what the limitations are.

The exact details of the "study" function are in the perlfunc man page:

    The way `study' works is this: a linked list of every character in
    the string to be searched is made, so we know, for example, where
    all the `'k'' characters are.  From each search string, the rarest
    character is selected, based on some static frequency tables
    constructed from some C programs and English text.  Only those
    places that contain this "rarest" character are examined.

Anyway, you should probably consider using the Benchmark module,
"perldoc Benchmark" for details.  You can then play around with various
combinations:

Case 1:

     return 0 if /re1/;
     return 0 if /re2/;
     return 1 if /re3/;
     return 1 if /re4/;
     return 1 if /re5/;

Case 2:

     return 0 if /re1/ || /re2/;
     return 1 if /re3/ || /re4/ || /re5/;

Case 3:

     return 0 if /(re1)|(re2)/;
     return 1 if /(re3)|(re4)|(re5)/;

Case 4:

     if (/re1/) {
         return 0;
     } elsif (/re2/) {
         return 0;
     } elsif (/re3/) { 
         return 1;
     } elsif (/re4/) {
         return 1;
     } elsif (/re5/) {
         return 1;
     }

Case 5:

     if (/re1/ || /re2/) {
         return 0;
     } elsif (/re3/ || /re4/ || /re5/) {
         return 1;
     }

Case 6:

     if (/(re1)|(re2)/) {
         return 0;
     } elsif (/(re3)|(re4)|(re5)/) {
         return 1;
     }

Make sure you run the various cases matching your various types (e.g.
re1, re4, etc.).  I know for a fact that Case 4 is slower than Case 1,
but Case 5 or Case 6 may be faster so I threw it in.

Try analysing your data and putting your most common cases first, so
they will match sooner and return before the rest are executed.

If any of your expressions are exact matches, /^string$/, then use eq:

   return 5 if ($_ eq "string");

If any of your expressions are simple constant substrings, / something/,
then you may wish to try index():

   return 6 if (index($_, " something") != -1);

Good luck!

-- 
Shane
Carpe Diem
_______________________________________________
LogAnalysis mailing list
LogAnalysisat_private
http://lists.shmoo.com/mailman/listinfo/loganalysis