[logs] Re: Anonymizing System Logs

oliner@private

Thanks to everyone who replied, both on and off the list.

On Jan 23, 2007, at 2:41 AM, Tom Le wrote:
> Pattern matching is the way to go but it doesn't need to be  
> "excessive", just performed in a systematic way.  There are some  
> methods to rapidly annonymize your data.  For example, if you have  
> access to users, hosts, and other "objects" on the various systems  
> generating logs, you can rapidly build a dictionary without having  
> to go through the painful process of identifying each type of message.
That's a good idea. We've been using the passwd and group files for  
the systems to get information like usernames and home directories.  
In addition, there were several obvious patterns that we could  
remove, like IPs and email addresses. I say it is excessive because  
we are aiming to err on the side of caution, which means stripping  
some data that isn't actually sensitive. For example, there are  
usernames that share structure with text found in message bodies, but  
which are semantically distinct; we strip those out to be sure. The  
logs are tens of GB, each, so any manual review is essentially  
intractable.

> Most mainframe systems lend themselves to anonymizing easily either  
> because the messages are well delimieted or have a published schema  
> ( e.g. OS390).
True. Some of our systems use either a customized RAS infrastructure  
or a combination of common formats; both complicate this process.

> Make sure you modify the timestamps.  This is a common step left  
> out during log anonymization.  Even with all object names  
> anonymized, useful security information can be obtained by looking  
> at timestamps (e.g. when do various cron jobs run which an attacker  
> could use the time frame to disguise an attack).
We haven't run across any resistance with publicizing the timestamps.  
Partly, this is because the systems are well-publicized. Six months  
of logs on a supercomputer that's only six months old pretty much  
pins down the real dates. Partly, this is because the systems are  
physically secured and not accessible externally. Once someone is  
able to so much as ping the system, it's already a bad day. The  
primary concern, from what I can tell, is that people will glean who  
is working on what.

> Another consideration is to chose if you will directly replace  
> object names with a static or variable replacement.  For example,  
> will you always replace "server123" with "aaa" and user "john.doe "  
> with "bbb", or will you vary all object name replacements?
We want this data to be useful for event prediction and intrusion  
detection research, so our goal is to maintain consistent mappings  
where possible.

> Finally, you can consider the concept of "polluting" your own  
> data.  If you have concern that your anonymized data can still be  
> reverse engineered to some degree that may be useful to an  
> attacker, generate fake events of equal proportion across all your  
> devices.  For example, generate sshd and apache events across all  
> your Unix servers.  Generate Active Directory, MSSQL and IIS events  
> across all your Windows servers.  If you're concerned about timing  
> of your batch processes being detected - duplicate output from your  
> batch jobs with equal time frequencies.  You don't need to actual  
> generate the event on the system, just produce them from known log  
> events and insert them into the proper location, modifing the  
> necessary attributes.  This way when you run your anonymizing  
> process, the character & accuracy of the log data is still there  
> (for log analysis), but any attempt to reverse engineer the data  
> will either be useless or greatly increase the probability of  
> running into many dead ends.
These are more good ideas. I'll take a better look at it.

Thanks again.

>  On 1/22/07, Adam Oliner <oliner@private> wrote:
> My colleagues and I have obtained access to a large collection of
> system logs from 5 major supercomputers, and are currently working to
> get them into a form such that they are suitable for public release.
> These are raw logs, aggregated, in some cases, from many log-
> generating components (Lustre, netwatch, eventlogs, syslog...). They
> represent, cumulatively, more than 775 million processor-hours.
>
> The primary pieces of data that we are trying to anonymize are
> usernames, group names, pathnames, and IP/hostnames. So, we are
> looking for some input from the log analysis community.
>
> 1) Aside from some possibly-excessive pattern matching, can you
> suggest a good way of masking out this data from the unstructured
> message bodies?
>
> 2) Assuming that all such data was successfully removed, what other
> security concerns would you have? How might we address them?
>
> We would greatly appreciate your help.
>
> Sincerely,
>
> - Adam J. Oliner
>    oliner@private
>    Department of Computer Science
>    Stanford University
>
>
>
> _______________________________________________
> LogAnalysis mailing list
> LogAnalysis@private
> http://lists.shmoo.com/mailman/listinfo/loganalysis
>

  - Adam J. Oliner
    oliner@private
    Department of Computer Science
    Stanford University

_______________________________________________
LogAnalysis mailing list
LogAnalysis@private
http://lists.shmoo.com/mailman/listinfo/loganalysis