Tina Bird <tbird@precision-guesswork.com> writes: > Bruce Schneier and I are working on a whitepaper on best practices > for building an enterprise logging infrastructure (a direct result > of a couple of the conversations on this list). I've been pondering > what should go in it. Does any one out there have ideas or > references about de facto standards in logging for particular > applications or OSes? We were having a, um, discussion about Oracle > logging... Not sure exactly what "building an enterprise logging infrastructure" is, but here is an offering based on several years' experience with the data network for a small computer telephony company. As you will see, I have trouble separating logging from monitoring. HTH. 1. There are two main kinds of log information. A. Management Logging. Logs generated by apps during normal operation. May be used for billing, customer reports, resource planning, etc. May be primary data source for these or secondary source used to confirm reports coming from applications. B. Operations Logging. Logs generated by the monitoring system. May result from polling or from triggered notification (traps); the distinction blurs when there are multiple layers of information gathering. 2. Work with applications architects and programmers. You will get the most out of your log output if you can work with your software engineers on what will be produced as each application is written. For us, this means high-tier tech ops gets in on design meetings from time to time, and sees sneak previews of log output from apps under development. We do not have a single all-encompassing logging spec that must be followed for all resources. We tried that approach but found it led to lots of meetings and vague documents but nothing anyone wanted to use in real life with real delivery schedules. Instead, we have a small set of guidelines. The role of ops during design and programming is to offer gentle but persistent reminders of the guidelines. 3. Log everything that is relevant, and nothing else. Code that formats a log message must include all obvious clues as to what is going on. This is just Error Messages 101: Examples of bad messages: transaction complete file not found Examples of better messages: application xyz001 transaction id 1234567 complete status 001 service ttsd file /etc/ttsd.conf not found - aborting OTOH, it is not the job of a process at OSI Layer 7 to diagnose a problem at Layer 3, e.g. it should not go about issuing pings and such to put into the log message when a SQL INSERT fails. For each software system, decide what the atomic events are, and log all of them. 4. Use existing open protocols and data formats. We don't want to reinvent the wheel, nor lock ourselves into a single vendor. Output in proprietary formats gets converted to ASCII before anything else happens. We use a pragmatic mix of syslog, SNMP, and ad hoc protocols to move logging information around. If SNMP is available for a resource, we use it, often polling and collecting traps on a syslog server; if it would take too long to support SNMP, we go with something quicker to implement. Data format is usually free-form ASCII. We gave up trying to guess today what fields are needed in tomorrow's logs. 5. Use logging priorities consistently. There are at least three priorities at which to log: normal events (LOG_INFO) e.g.: transaction complete errors with no detected loss of resource (LOG_NOTICE) e.g.: invalid account number resource unavailable event (LOG_ERR) e.g.: server unreachable (When using syslog, we don't use all seven priorities.) Loss of resource and return of resource are logged at same priority. Don't bother putting in lots of knobs for various levels of verbosity and content - that's what grep is for! 6. Keep log delivery simple. Keep to an absolute minimum the number of steps between the system creating log information and the person who needs it. The more complex a system is to configure and maintain, the less likely it is to be used. Avoid glitz and eye candy. The #1 most successful use of logging we have today simply scoops up new log content, looks for interesting items, and emails selected staff. This is after multiple generations of all sorts of more complicated stuff. We still run fancy GUIfied monitoring screens, but that is mainly for the visitors. :) 7. Never trust a system to police itself. Every major resource must be monitored constantly for availability, or it will go away. But, systems need to be probed from the outside. What happens if someone trips over the power cord? --------------------------------------------------------------------- To unsubscribe, e-mail: loganalysis-unsubscribeat_private For additional commands, e-mail: loganalysis-helpat_private
This archive was generated by hypermail 2b30 : Wed Oct 10 2001 - 11:07:37 PDT