RE: [logs] Charset selection (Was: Re: EventLog library)

From: Rainer Gerhards (rgerhardsat_private)
Date: Thu Jan 09 2003 - 11:37:59 PST

  • Next message: Bennett Todd: "Re: [logs] Charset selection (Was: Re: EventLog library)"

    Mmmmmhhh - a very good explanation why we should stick with 7 bit chars. However, my actual experience with the Japanese market - and recently a Chinese customer - tells me that it is *really* important for those cultures to receive messages in their native scripts. Think at it that way: by encoding DBCS into an escaped form (e.g. %<hex><hex>), this is no longer human readable. The end result is that the humans won't look at the logs because it is no longer readble to them. You also raise acceptance problem with such an "ignorant western point of view" (please note the QUOTES, not my opinion but the way I see it received!).
    And the big thing is this is the case for almost all non-english speaking countries. Take the French. Like in all European languages I know, there are some extensions to US-ASCII to cover a few local characters (e.g.  in German). These characters routinely appear in human generated message, system generated messages and even user names (at least under win32). Escaping them all does not make the logs more readable. And it would be very hard to argue with a french admin that these characters are kind of non-standard...
    So, yes, I see the security concerns you raise and I also think they are very valid. But on the other hand, I opt for the second-best solution, which is to allow 8 bit characters. I am doing so in favour of acceptance and human-readability of the logs.
    Regarding the problem of displaying the unknown different character sets, this definitely is a weakness. However, my experinece shows that typically only a single script is in the logs. Sure, this is a problem if you are a big guy consolidating data from different countries into a central host. But then I would suggest to use a relay that takes SELP messages and forwards them via RFC3195 COOKED - there you can specify the proper charset...
    OK, that was the general thing. Back to the protcol, I have to admit I need to think a little more about how this could be included. Of course, a big warning paragraph would be needed for those using 8 bit chars. I have some solution in my mind... Give me a few more minutes, I'll post something when I have thought a little more about it.
    How do you feel about the general idea I explained above?
    > 2003-01-08T20:34:34 Darren Reed:
    > > Is there a compelling reason to keep traffic between log daemons in 
    > > "text strings" rather than wrap them up in something else with byte 
    > > counts and no CR-LF stuff and just exchange typed data in a manner 
    > > that allows you to be ignorant of what character set is in use ?
    > I'd argue rather that if we aren't going to ignore this 
    > issue, we should settle it by mandating strict 7-bit US-ASCII 
    > printables in the normal 8bit embedding. If we produce a 
    > specification or implementation that's tolerant of 8bit 
    > messages, we're setting ourselves up for a bomb to go off 
    > under our kiesters down the road, when different log text 
    > processors apply radically different interpretations to the 
    > exact same logged message --- and some of those 
    > interpretations tickle bugs causing security problems.
    > If instead we force people who want to syslog kanji, or 
    > accented characters, or anything else outside of strict 7bit 
    > US-ASCII to go with some encoding onto US-ASCII, like e.g. 
    > SGML entity references; then we'd have the characteristic 
    > that implementations would have the privilege of being blind 
    > to charsets without running a risk of introducing security problems.
    > This isn't a critique of the appropriateness of the general 
    > concept of being binary-transparent and letting people pick 
    > interpretations that suit 'em; in many venues that works 
    > really well. But logging tends to lie fairly near to security 
    > concerns, and right now charsets are a fraught area, with 
    > different people advocating different solutions, applying 
    > different interpretations to 8-bit-binary data, and in some 
    > cases opening unexpected ways to slip dangerous embedded 
    > characters past screeners trying to block them.
    > Suppose someone wants to write a nice generic logfile viewer, 
    > that presents sliced-n-diced log data to a web browser. 
    > They're already going to be having to escape "<", ">", and 
    > "&" in the logged text before croaking it out at the browser. 
    > Let's not force them to also know every possible way anyone 
    > can ever invent to encode those in any possible multibyte charset.
    > -Bennett
    LogAnalysis mailing list

    This archive was generated by hypermail 2b30 : Thu Jan 09 2003 - 12:29:51 PST