On Tue, Jun 18, 2002 at 10:56:40AM -0700, Bill Rhodes wrote: > If you are truly after testable data merely for empirical purposes, you > might do well to write a small Perl script or some such which will excerpt > and then anonymize some statistically large yet still manageable chunk of > log data. Ask, and ye shall receive; I just sat down and hacked together a pretty nice (if I do say so myself) tool to do this, and have posted it as <http://www.enterpriselogging.net/tools/scrub_log>. It doesn't handle the task of extracting a random sample, but I've got another script to do that that I'll be posting shortly in that same directory, so the two could be connected in series to get what you want. Here's the embedded documentation from scrub_log (extractable from the original file via the pod2text/pod2html/pod2whatever commands); let me know if you all find it useful, if there are any bugs, if there are any features you'd like added, etc.: scrub_log [ -p <pattern> ... ] [ -i ] [ -r <read_config_file> | -w <write_config_file> | -c <config_file> ] input_file ... Sample Invocations: $ scrub_log /var/log/syslog/local0log $ scrub_log -p 10.3. -p idiopathic.net /var/log/syslog/* $ scrub_log -c /etc/log/scrub_map -p idiopathic.net -i /var/log/syslog/* scrub_log acts as a filter on its input file(s) (or standard input), replacing potentially sensitive text strings with placeholders so that those files (usually containing log data, hence the name) can be shared without fear of disclosing that sensitive information. All strings that are "scrubbed" are replaced by a unique identifier in a one-to-one mapping of original string to replacement; a file containing three distinct IP addresses, for example, each of which occurred multiple times, would have three distinct replacement strings used--one for each original IP address. Mappings are generated uniquely for each invocation, but persist across input files of a single invocation; if a second file were provided in the example just mentioned, which contained one of the IP addresses present in the first file, then the replacement string for that IP address would be the same as the replacement string used for the first file. The -r, -w, and -c flags can be used to store state between invocations. The -w flag specifies a config file to which scrub_log should write out a dump of its replacement string mapping; the -r flag can then be used to specify such a config file from which to read a previously created mapping. The -c flag is a shorthand to specify a single file that should be used to seed the mapping before running, and to which the new mapping table (containing any additional mappings generated during the most recent run) should be added after completion. (These files are managed using the Perl Data::Dumper module, and can thus be modified by hand if so desired.) By default, scrub_log replaces all IP address-containing "words" (defined below) in the input file(s) with the string "SCRUBBED_STRING_n", where n is a non-padded integer that uniquely maps each SCRUBBED_STRING back to an original IP. The -p flag can be specified along with a literal (i.e. non-regex) pattern that should be searched for instead of IP addresses; multiple such patterns can be specified by the use of multiple -p flags. Using any -p flags turns off the default behaviour of scrubbing IP addresses; using the -i flag will turn that behaviour back on even if -p flags are also used. Note that a "word" is defined by scrub_log to be any contiguous sequence of alphanumeric, period, or hyphen characters. The string 'a192.168.1.5', then, will be considered to contain an IP address, and be scrubbed by the default behaviour; at the moment, there is no way to tell scrub_log to do an exact match for a pattern rather than a match on any word in which the pattern exists. Also note that matched words are checked to see if they are IP address (rather than simply containing IP addresses); if that condition is true, then the replacement string is "SCRUBBED_IP_n" rather than "SCRUBBED_STRING_n". This behaviour is applied to matches generated by both user-specified patterns and the default IP-matching pattern; as a result, the default pattern will usually only produce SCRUBBED_IP strings. Note that the mapping of index numbers between original strings and replacements is done separately for IP addresses and strings, such that SCRUBBED_IP_1 and SCRUBBED_STRING_1 will each refer to a distinct original string, the former of which would be an IP address. -- Sweth. -- Sweth Chandramouli Idiopathic Systems Consulting svcat_private http://www.idiopathic.net/ --------------------------------------------------------------------- To unsubscribe, e-mail: loganalysis-unsubscribeat_private For additional commands, e-mail: loganalysis-helpat_private
This archive was generated by hypermail 2b30 : Wed Jun 19 2002 - 12:57:04 PDT