[logs] Re: Anonymizing System Logs

From: Christina Noren (cfrln@private)
Date: Mon Jan 22 2007 - 22:52:35 PST


We include a fairly robust anonymizer command in the free Splunk  
download. www.splunk.com. You don't need to use the Splunk tool in  
order to take advantage of this command.

Docs on how to use it: http://www.splunk.com/doc/admin/dataanon

Features:

- It uses a dictionary plus an explicit public/private terms list to  
determine what to anonymize.
- It preserves correlation by substituting a given string with the  
same replacement throughout all files handled in a single  
anonymization pass.
- We've added common semantically meaningful terms such as "login"  
and "smtp" to the canned public terms list so they are not over- 
anonymized.
- You can run a sample of data through in test mode and it will  
suggest terms to add to the public and private lists based on  
analyzing overall term frequency.
- It also records all anonymized strings in an audit file so you can  
review it for over-anonymization.
- It preserves the integrity of data types by replacing numeric  
strings with lower digits, replacing text strings with randomly  
chosen names from a list of english given names and preserving string  
length.

We're very interested in your experience with the tool - drop us a  
line at support@private to let us know if you have any issues or  
suggestions.


On Jan 22, 2007, at 12:47 PM, Adam Oliner wrote:

> My colleagues and I have obtained access to a large collection of
> system logs from 5 major supercomputers, and are currently working to
> get them into a form such that they are suitable for public release.
> These are raw logs, aggregated, in some cases, from many log-
> generating components (Lustre, netwatch, eventlogs, syslog...). They
> represent, cumulatively, more than 775 million processor-hours.
>
> The primary pieces of data that we are trying to anonymize are
> usernames, group names, pathnames, and IP/hostnames. So, we are
> looking for some input from the log analysis community.
>
> 1) Aside from some possibly-excessive pattern matching, can you
> suggest a good way of masking out this data from the unstructured
> message bodies?
>
> 2) Assuming that all such data was successfully removed, what other
> security concerns would you have? How might we address them?
>
> We would greatly appreciate your help.
>
> Sincerely,
>
>   - Adam J. Oliner
>     oliner@private
>     Department of Computer Science
>     Stanford University
>
>
>
> _______________________________________________
> LogAnalysis mailing list
> LogAnalysis@private
> http://lists.shmoo.com/mailman/listinfo/loganalysis

_______________________________________________
LogAnalysis mailing list
LogAnalysis@private
http://lists.shmoo.com/mailman/listinfo/loganalysis



This archive was generated by hypermail 2.1.3 : Tue Jan 23 2007 - 09:38:50 PST