tbird> 1) What sort of state changes "should" applications and operating systems tbird> log in the first place? --> A standard for programmers [and big list of possible categories] I'll add one to that: resource utilization. I'm thinking of questions like "how much memory did it take to complete this task" (or system/user time, or whatever). This is going to be a lot more application specific, but I want it in the interest of planning. If I'm going to build a better server for foo, then these sorts of questions are the ones I need answered. tbird> - Object access: failed and successful attempts to read files, start or tbird> stop processes, etc (understanding that most organizations will not need tbird> or want this level of detail) I'll expand this one: object processing. What was done to an object and the outcome of that action. Mail queue ID was passed on to recipient after passing through virus filter; packet was dropped according to rule n... I think that most organizations -would- want this level of detail, at least for some applications, if only for the pretty graphs they can generate. tbird> 3) Given a particular operating system and/or system purpose, what are tbird> (pick your favorite integer) 15 messages that pretty much always mean bad tbird> news: that the system has been compromised, that a catastrophic failure tbird> has happened, however we choose to define "bad news" for that "typical" tbird> environment? What >>is<< "bad news"? Do we have sample data? The problem here is that we can define a very small number of states in which a machine can be thought of as working properly. However, there are many more states in which it is working improperly. I'm sure we could come up with 15 great signs for really bad news, but I would argue that if you see one of those in your log file, you're already hosed. What I want is the news 15 minutes prior that tells me the system has slipped out of optimal state. Rarely does a system go from fully functional to critical (except when I get my rock hammer out) -- it slips, bit by bit, and we should be able to detect this (we can't always -- go back to question 1 and start logging the appropriate data). The new red background on your website was probably preceded by a number of those inoccuous looking login failures, possibly from strange locations. The disk failure was likely preceded by SCSI bus errors. And so on. I'd be willing to crunch more sample log data, if you'd like. Of course, we could start the log parsing debate up again as well. This is a qustion of where we're going vs. where we are. Perhaps my suggestion would be to have a bunch of sample signatures that we could pop into swatch or logcheck that would weed out (or in) some of the most common messages. Include comments. --rowan -- John "Rowan" Littell Systems Administrator Earlham College Computing Services http://www.earlham.edu/~littejo/
This archive was generated by hypermail 2b30 : Tue Aug 20 2002 - 11:47:57 PDT