| 1) Aside from some possibly-excessive pattern matching, can you | suggest a good way of masking out this data from the unstructured | message bodies? Pattern matching is the way to go but it doesn't need to be "excessive", just performed in a systematic way. There are some methods to rapidly annonymize your data. For example, if you have access to users, hosts, and other "objects" on the various systems generating logs, you can rapidly build a dictionary without having to go through the painful process of identifying each type of message. You can also look at the "object grammar" and usually arrive at an 80/20 or 90/10 solution without having to identify each specific log. For example, instead of identifying each type of Unix message separately, you can look at the prepositions used such as: "from", "to", "for user". Unix logs are generally easier to anonymize because you won't see spaces in object names nearly as much as Windows. Most mainframe systems lend themselves to anonymizing easily either because the messages are well delimieted or have a published schema (e.g. OS390). | 2) Assuming that all such data was successfully removed, what other | security concerns would you have? How might we address them? Make sure you modify the timestamps. This is a common step left out during log anonymization. Even with all object names anonymized, useful security information can be obtained by looking at timestamps (e.g. when do various cron jobs run which an attacker could use the time frame to disguise an attack). However, keeping the sequence of events in tact will be of great use to log analysis. This is a double-edge sword as the relative timestamp between events can reveal information as well. For example, I want to exploit a Veritas Netbackup vulnerability but I only want to launch the attack during the weekly full backup jobs. You modify the timestamp to mask the time of the job, but you keep the sequence of other events, so while I don't know when the weekly backup job runs, I know it occurs X minutes after Y event. Another consideration is to chose if you will directly replace object names with a static or variable replacement. For example, will you always replace "server123" with "aaa" and user "john.doe" with "bbb", or will you vary all object name replacements? Again, this is a double-edged sword. There is a lot of value in looking at the sequence of log events which you can only do effectively if the object names have a static mapping. But you do give away some security as some information can be reversed engineered from the logs with static mappings. One compromise here is to vary the mappings every X lines - e.g. maintain static mappings every X lines only. Or better yet, randomly select when you will rotate object mappings. Finally, you can consider the concept of "polluting" your own data. If you have concern that your anonymized data can still be reverse engineered to some degree that may be useful to an attacker, generate fake events of equal proportion across all your devices. For example, generate sshd and apache events across all your Unix servers. Generate Active Directory, MSSQL and IIS events across all your Windows servers. If you're concerned about timing of your batch processes being detected - duplicate output from your batch jobs with equal time frequencies. You don't need to actual generate the event on the system, just produce them from known log events and insert them into the proper location, modifing the necessary attributes. This way when you run your anonymizing process, the character & accuracy of the log data is still there (for log analysis), but any attempt to reverse engineer the data will either be useless or greatly increase the probability of running into many dead ends. HTH, Tom On 1/22/07, Adam Oliner <oliner@private> wrote: > > My colleagues and I have obtained access to a large collection of > system logs from 5 major supercomputers, and are currently working to > get them into a form such that they are suitable for public release. > These are raw logs, aggregated, in some cases, from many log- > generating components (Lustre, netwatch, eventlogs, syslog...). They > represent, cumulatively, more than 775 million processor-hours. > > The primary pieces of data that we are trying to anonymize are > usernames, group names, pathnames, and IP/hostnames. So, we are > looking for some input from the log analysis community. > > 1) Aside from some possibly-excessive pattern matching, can you > suggest a good way of masking out this data from the unstructured > message bodies? > > 2) Assuming that all such data was successfully removed, what other > security concerns would you have? How might we address them? > > We would greatly appreciate your help. > > Sincerely, > > - Adam J. Oliner > oliner@private > Department of Computer Science > Stanford University > > > > _______________________________________________ > LogAnalysis mailing list > LogAnalysis@private > http://lists.shmoo.com/mailman/listinfo/loganalysis > _______________________________________________ LogAnalysis mailing list LogAnalysis@private http://lists.shmoo.com/mailman/listinfo/loganalysis
This archive was generated by hypermail 2.1.3 : Tue Jan 23 2007 - 09:40:56 PST