>>>>> On Fri, 6 Dec 2002 10:33:15 +1100 (Australia/ACT), Darren Reed <avalonat_private> said: DR> Well, the big problem with lots of data, to me, is not how to collect DR> it all in a reliable fashion but what do you do with it all ? Do you DR> just archive it to CD or DVD on a regular basis in case someone with DR> a warrant comes knocking or do you generate load graphs or something DR> else from it ? Who's going to look at all the log output from 20k DR> nodes when they're all screaming and sending a message every second ? DR> I dealing with so much data, there are logistical problems as much as DR> technical ones to solve that make the technical ones seem trivial. Well, the problem boils down to mechanisms to collect the data, systems to put it somewhere safe (transport), and systems to do big analysis. We're trying to address the audit transport problem. We think we've helped with the collection problem, but that's really a configuration and policy issue. Our next step will be analysis. But we realized years ago that all the transport, e.g. classic syslog, was crap. It the protocol, not the implementations. When syslog-reliable finally came out, for better or worse, that's what we've got. And it probably sucks less than UDP :-) We have groups here and around campus that specialize in data mining. We are trying to figure out how to put our security experience together with their data mining magic to get useful information from the raw data. But without data, nothing to mine. Without enough data, mining gives you not so useful results. If the data is low integrity, you get low integrity results. The logistics of big data, we're familiar with. We have users who think that a Terabyte is a good size for a small data set :-) We're currently spinning about 120 Tbytes of disk, and expect to hit a Peta byte of disk some time late next year. We've got HPSS with about 6 petabytes in it, and we'll probably grow that to about 50 or 60 Pbytes in the next 2 years. I'm a packrat. I've saved every syslog record since 1996 or so: 5007 pitofdespair:/scratch/slocal/tep-test/logs % ls 1994 1996 1997 1998 1999 2000 2001 2002 5008 pitofdespair:/scratch/slocal/tep-test/logs % df -lkh Filesystem Size Used Avail Use% Mounted on /dev/sda1 5.9G 2.7G 2.9G 47% / none 243M 0 243M 0% /dev/shm /dev/sda3 520G 373G 120G 76% /scratch/slocal So, see, that's only 373G so far. That's 2,825,305,174 lines as of the end of October. That's pretty manageable. Our supercomputer users think that this dataset is "cute", and might be interesting if it ever grows up :-) DR> So, in a sense, what it comes down to is you only spend serious effort DR> logging data securely that you care about and the rest goes to /dev/null, DR> whether directly or indirectly. If you're doing that and the messages DR> you are interested in only make up a minority of those being generated, DR> why do you need such high performance as opposed to good filtering on DR> the sender(s) ? Because I never know what I'm going to be looking for. Its like astronomy, or climate modeling. If you are looking at a particular star, and throw away the images because you are done studying the star, then people can't use your images to find Earth-crossing asteroids, or new planets, 10 years later. Similar things with climate modeling. Sometimes you need a long-baseline, wide-spectrum data set to see long-term trends, or to find out just when some significant event *really* began, when it was very, very small. A Cray cycle wasted, is lost forever. A byte that wasn't collected and saved can never be collected in the future. It gone. Space is cheap, too bad you need that byte now. Also, we're a research place, so perhaps we just have a warped sense of packratism. DR> The point I was trying to make was when you're trying to get really DR> high performance out of standard hardware, you need to tune lots of DR> corner cases. Agreed. Sometimes Moore's Law just isn't enough. Sometimes you just have to get clever and actually write some slick code instead of just throwing hardware at it. --tep _______________________________________________ LogAnalysis mailing list LogAnalysisat_private http://lists.shmoo.com/mailman/listinfo/loganalysis
This archive was generated by hypermail 2b30 : Fri Dec 06 2002 - 10:02:41 PST