Re: [logs] SDSC Secure Syslog

tepat_private

>>>>> On Fri, 6 Dec 2002 10:33:15 +1100 (Australia/ACT), Darren Reed <avalonat_private> said:

    DR> Well, the big problem with lots of data, to me, is not how to collect
    DR> it all in a reliable fashion but what do you do with it all ?  Do you
    DR> just archive it to CD or DVD on a regular basis in case someone with
    DR> a warrant comes knocking or do you generate load graphs or something
    DR> else from it ?  Who's going to look at all the log output from 20k
    DR> nodes when they're all screaming and sending a message every second ?
    DR> I dealing with so much data, there are logistical problems as much as
    DR> technical ones to solve that make the technical ones seem trivial.

Well, the problem boils down to mechanisms to collect the data,
systems to put it somewhere safe (transport), and systems to do big
analysis.

We're trying to address the audit transport problem.  We think we've
helped with the collection problem, but that's really a configuration
and policy issue.  Our next step will be analysis.

But we realized years ago that all the transport, e.g. classic syslog,
was crap.  It the protocol, not the implementations.  When
syslog-reliable finally came out, for better or worse, that's what
we've got.  And it probably sucks less than UDP :-)

We have groups here and around campus that specialize in data mining.
We are trying to figure out how to put our security experience
together with their data mining magic to get useful information from
the raw data.

But without data, nothing to mine.  Without enough data, mining gives
you not so useful results.  If the data is low integrity, you get low
integrity results.

The logistics of big data, we're familiar with.  We have users who
think that a Terabyte is a good size for a small data set :-)  We're
currently spinning about 120 Tbytes of disk, and expect to hit a Peta
byte of disk some time late next year.  We've got HPSS with about 6
petabytes in it, and we'll probably grow that to about 50 or 60 Pbytes
in the next 2 years.

I'm a packrat.  I've saved every syslog record since 1996 or so:

    5007 pitofdespair:/scratch/slocal/tep-test/logs % ls
    1994  1996  1997  1998  1999  2000  2001  2002
    5008 pitofdespair:/scratch/slocal/tep-test/logs % df -lkh
    Filesystem            Size  Used Avail Use% Mounted on
    /dev/sda1             5.9G  2.7G  2.9G  47% /
    none                  243M     0  243M   0% /dev/shm
    /dev/sda3             520G  373G  120G  76% /scratch/slocal

So, see, that's only 373G so far.  That's 2,825,305,174 lines as of
the end of October.  That's pretty manageable.  Our supercomputer
users think that this dataset is "cute", and might be interesting if
it ever grows up :-)

    DR> So, in a sense, what it comes down to is you only spend serious effort
    DR> logging data securely that you care about and the rest goes to /dev/null,
    DR> whether directly or indirectly.  If you're doing that and the messages
    DR> you are interested in only make up a minority of those being generated,
    DR> why do you need such high performance as opposed to good filtering on
    DR> the sender(s) ?

Because I never know what I'm going to be looking for.  Its like
astronomy, or climate modeling.  If you are looking at a particular
star, and throw away the images because you are done studying the
star, then people can't use your images to find Earth-crossing
asteroids, or new planets, 10 years later.  Similar things with
climate modeling.  Sometimes you need a long-baseline, wide-spectrum
data set to see long-term trends, or to find out just when some
significant event *really* began, when it was very, very small.

A Cray cycle wasted, is lost forever.  A byte that wasn't collected
and saved can never be collected in the future.  It gone.  Space is
cheap, too bad you need that byte now.

Also, we're a research place, so perhaps we just have a warped sense
of packratism.

    DR> The point I was trying to make was when you're trying to get really
    DR> high performance out of standard hardware, you need to tune lots of
    DR> corner cases.

Agreed.  Sometimes Moore's Law just isn't enough.  Sometimes you just
have to get clever and actually write some slick code instead of just
throwing hardware at it.

--tep
_______________________________________________
LogAnalysis mailing list
LogAnalysisat_private
http://lists.shmoo.com/mailman/listinfo/loganalysis