[logs] Re: Anonymizing System Logs

dottom@private

| 1) Aside from some possibly-excessive pattern matching, can you
| suggest a good way of masking out this data from the unstructured
| message bodies?

Pattern matching is the way to go but it doesn't need to be "excessive",
just performed in a systematic way.  There are some methods to rapidly
annonymize your data.  For example, if you have access to users, hosts, and
other "objects" on the various systems generating logs, you can rapidly
build a dictionary without having to go through the painful process of
identifying each type of message.

You can also look at the "object grammar" and usually arrive at an 80/20 or
90/10 solution without having to identify each specific log.  For example,
instead of identifying each type of Unix message separately, you can look at
the prepositions used such as: "from", "to", "for user".  Unix logs are
generally easier to anonymize because you won't see spaces in object names
nearly as much as Windows.  Most mainframe systems lend themselves to
anonymizing easily either because the messages are well delimieted or have a
published schema (e.g. OS390).

| 2) Assuming that all such data was successfully removed, what other
| security concerns would you have? How might we address them?

Make sure you modify the timestamps.  This is a common step left out during
log anonymization.  Even with all object names anonymized, useful security
information can be obtained by looking at timestamps (e.g. when do various
cron jobs run which an attacker could use the time frame to disguise an
attack).

However, keeping the sequence of events in tact will be of great use to log
analysis.  This is a double-edge sword as the relative timestamp between
events can reveal information as well.  For example, I want to exploit a
Veritas Netbackup vulnerability but I only want to launch the attack during
the weekly full backup jobs.  You modify the timestamp to mask the time of
the job, but you keep the sequence of other events, so while I don't know
when the weekly backup job runs, I know it occurs X minutes after Y event.

Another consideration is to chose if you will directly replace object names
with a static or variable replacement.  For example, will you always replace
"server123" with "aaa" and user "john.doe" with "bbb", or will you vary all
object name replacements?  Again, this is a double-edged sword.  There is a
lot of value in looking at the sequence of log events which you can only do
effectively if the object names have a static mapping.  But you do give away
some security as some information can be reversed engineered from the logs
with static mappings.  One compromise here is to vary the mappings every X
lines - e.g. maintain static mappings every X lines only.  Or better yet,
randomly select when you will rotate object mappings.

Finally, you can consider the concept of "polluting" your own data.  If you
have concern that your anonymized data can still be reverse engineered to
some degree that may be useful to an attacker, generate fake events of equal
proportion across all your devices.  For example, generate sshd and apache
events across all your Unix servers.  Generate Active Directory, MSSQL and
IIS events across all your Windows servers.  If you're concerned about
timing of your batch processes being detected - duplicate output from your
batch jobs with equal time frequencies.  You don't need to actual generate
the event on the system, just produce them from known log events and insert
them into the proper location, modifing the necessary attributes.  This way
when you run your anonymizing process, the character & accuracy of the log
data is still there (for log analysis), but any attempt to reverse engineer
the data will either be useless or greatly increase the probability of
running into many dead ends.

HTH,
Tom

On 1/22/07, Adam Oliner <oliner@private> wrote:
>
> My colleagues and I have obtained access to a large collection of
> system logs from 5 major supercomputers, and are currently working to
> get them into a form such that they are suitable for public release.
> These are raw logs, aggregated, in some cases, from many log-
> generating components (Lustre, netwatch, eventlogs, syslog...). They
> represent, cumulatively, more than 775 million processor-hours.
>
> The primary pieces of data that we are trying to anonymize are
> usernames, group names, pathnames, and IP/hostnames. So, we are
> looking for some input from the log analysis community.
>
> 1) Aside from some possibly-excessive pattern matching, can you
> suggest a good way of masking out this data from the unstructured
> message bodies?
>
> 2) Assuming that all such data was successfully removed, what other
> security concerns would you have? How might we address them?
>
> We would greatly appreciate your help.
>
> Sincerely,
>
> - Adam J. Oliner
>    oliner@private
>    Department of Computer Science
>    Stanford University
>
>
>
> _______________________________________________
> LogAnalysis mailing list
> LogAnalysis@private
> http://lists.shmoo.com/mailman/listinfo/loganalysis
>

_______________________________________________
LogAnalysis mailing list
LogAnalysis@private
http://lists.shmoo.com/mailman/listinfo/loganalysis