Re: [logs] OT: 'Automated Log Analysis'

From: Sweth Chandramouli (loganalysisat_private)
Date: Tue Jun 18 2002 - 22:11:16 PDT

  • Next message: Sweth Chandramouli: "Re: [logs] OT: 'Automated Log Analysis'"

    On Tue, Jun 18, 2002 at 10:56:40AM -0700, Bill Rhodes wrote:
    > If you are truly after testable data merely for empirical purposes, you
    > might do well to write a small Perl script or some such which will excerpt
    > and then anonymize some statistically large yet still manageable chunk of
    > log data.
    	Ask, and ye shall receive; I just sat down and hacked
    together a pretty nice (if I do say so myself) tool to do this, and
    have posted it as <http://www.enterpriselogging.net/tools/scrub_log>.  It
    doesn't handle the task of extracting a random sample, but I've got
    another script to do that that I'll be posting shortly in that same
    directory, so the two could be connected in series to get what you want.
    Here's the embedded documentation from scrub_log (extractable from the
    original file via the pod2text/pod2html/pod2whatever commands); let me
    know if you all find it useful, if there are any bugs, if there are any
    features you'd like added, etc.:
    
        scrub_log [ -p <pattern> ... ] [ -i ] [ -r <read_config_file> | -w
        <write_config_file> | -c <config_file> ] input_file ...
    
        Sample Invocations:
    
        $ scrub_log /var/log/syslog/local0log
    
        $ scrub_log -p 10.3. -p idiopathic.net /var/log/syslog/*
    
        $ scrub_log -c /etc/log/scrub_map -p idiopathic.net -i /var/log/syslog/*
    
        scrub_log acts as a filter on its input file(s) (or standard input),
        replacing potentially sensitive text strings with placeholders so that
        those files (usually containing log data, hence the name) can be shared
        without fear of disclosing that sensitive information.
    
        All strings that are "scrubbed" are replaced by a unique identifier in a
        one-to-one mapping of original string to replacement; a file containing
        three distinct IP addresses, for example, each of which occurred
        multiple times, would have three distinct replacement strings used--one
        for each original IP address. Mappings are generated uniquely for each
        invocation, but persist across input files of a single invocation; if a
        second file were provided in the example just mentioned, which contained
        one of the IP addresses present in the first file, then the replacement
        string for that IP address would be the same as the replacement string
        used for the first file.
    
        The -r, -w, and -c flags can be used to store state between invocations.
        The -w flag specifies a config file to which scrub_log should write out
        a dump of its replacement string mapping; the -r flag can then be used
        to specify such a config file from which to read a previously created
        mapping. The -c flag is a shorthand to specify a single file that should
        be used to seed the mapping before running, and to which the new mapping
        table (containing any additional mappings generated during the most
        recent run) should be added after completion. (These files are managed
        using the Perl Data::Dumper module, and can thus be modified by hand if
        so desired.)
    
        By default, scrub_log replaces all IP address-containing "words"
        (defined below) in the input file(s) with the string
        "SCRUBBED_STRING_n", where n is a non-padded integer that uniquely maps
        each SCRUBBED_STRING back to an original IP. The -p flag can be
        specified along with a literal (i.e. non-regex) pattern that should be
        searched for instead of IP addresses; multiple such patterns can be
        specified by the use of multiple -p flags. Using any -p flags turns off
        the default behaviour of scrubbing IP addresses; using the -i flag will
        turn that behaviour back on even if -p flags are also used.
    
        Note that a "word" is defined by scrub_log to be any contiguous sequence
        of alphanumeric, period, or hyphen characters. The string
        'a192.168.1.5', then, will be considered to contain an IP address, and
        be scrubbed by the default behaviour; at the moment, there is no way to
        tell scrub_log to do an exact match for a pattern rather than a match on
        any word in which the pattern exists.
    
        Also note that matched words are checked to see if they are IP address
        (rather than simply containing IP addresses); if that condition is true,
        then the replacement string is "SCRUBBED_IP_n" rather than
        "SCRUBBED_STRING_n". This behaviour is applied to matches generated by
        both user-specified patterns and the default IP-matching pattern; as a
        result, the default pattern will usually only produce SCRUBBED_IP
        strings. Note that the mapping of index numbers between original strings
        and replacements is done separately for IP addresses and strings, such
        that SCRUBBED_IP_1 and SCRUBBED_STRING_1 will each refer to a distinct
        original string, the former of which would be an IP address.
    
    	-- Sweth.
    
    -- 
    Sweth Chandramouli      Idiopathic Systems Consulting
    svcat_private      http://www.idiopathic.net/
    
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: loganalysis-unsubscribeat_private
    For additional commands, e-mail: loganalysis-helpat_private
    



    This archive was generated by hypermail 2b30 : Wed Jun 19 2002 - 12:57:04 PDT