My colleagues and I have obtained access to a large collection of system logs from 5 major supercomputers, and are currently working to get them into a form such that they are suitable for public release. These are raw logs, aggregated, in some cases, from many log- generating components (Lustre, netwatch, eventlogs, syslog...). They represent, cumulatively, more than 775 million processor-hours. The primary pieces of data that we are trying to anonymize are usernames, group names, pathnames, and IP/hostnames. So, we are looking for some input from the log analysis community. 1) Aside from some possibly-excessive pattern matching, can you suggest a good way of masking out this data from the unstructured message bodies? 2) Assuming that all such data was successfully removed, what other security concerns would you have? How might we address them? We would greatly appreciate your help. Sincerely, - Adam J. Oliner oliner@private Department of Computer Science Stanford University _______________________________________________ LogAnalysis mailing list LogAnalysis@private http://lists.shmoo.com/mailman/listinfo/loganalysis
This archive was generated by hypermail 2.1.3 : Mon Jan 22 2007 - 22:13:14 PST