My colleagues and I have obtained access to a large collection of
system logs from 5 major supercomputers, and are currently working to
get them into a form such that they are suitable for public release.
These are raw logs, aggregated, in some cases, from many log-
generating components (Lustre, netwatch, eventlogs, syslog...). They
represent, cumulatively, more than 775 million processor-hours.
The primary pieces of data that we are trying to anonymize are
usernames, group names, pathnames, and IP/hostnames. So, we are
looking for some input from the log analysis community.
1) Aside from some possibly-excessive pattern matching, can you
suggest a good way of masking out this data from the unstructured
message bodies?
2) Assuming that all such data was successfully removed, what other
security concerns would you have? How might we address them?
We would greatly appreciate your help.
Sincerely,
- Adam J. Oliner
oliner@private
Department of Computer Science
Stanford University
_______________________________________________
LogAnalysis mailing list
LogAnalysis@private
http://lists.shmoo.com/mailman/listinfo/loganalysis
This archive was generated by hypermail 2.1.3 : Mon Jan 22 2007 - 22:13:14 PST