RE: [logs] Charset selection (Was: Re: EventLog library)

From: Rainer Gerhards (rgerhardsat_private)
Date: Wed Jan 08 2003 - 03:48:43 PST

  • Next message: Estabrook, John (EIP): "[logs] syslog TCP discussion"

    > > Well, isn't UTF-8 a kind of DBCS encoding? And have you 
    > followed the 
    > > limited acceptance Unicode receives in Japan. The problem are 
    > > statements like yours IMHO. If I were Japanese, I wouldn't like to 
    > > read the the encoding I need to use to make things working 
    > is "evil".
    > Japanese and chinese writing systems are evil, too :)  </flamebait>
    > Hrm, I might have gone a bit overboard there. DBCS using lead 
    > bytes might still be easy to use (it doesn't insert NULs, does it?).
    No, it does not. From a conceptual point of view, it is very close to
    UTF-8. And there are no NULs in them <reliev />. We have worked with
    Japanese encodings, but from the docs I have this is quite the same for
    Chinese, Korean and Viatnamese, which belong to the same script family.
    However, there ARE control characters (with ANSI values below 0x20) in
    the stream. Nice things like CR and LF. You need to parse the lead/trail
    bytes to avoid accidently parsing them as terminators.
    Well, and here the standards trouble begin: it is simply *impossible* to
    write a (e.g.) RFC3164 compatible daemon supporting Japanse. RFC3164
    prohibts CRLF in the message part but we do have these byte values in
    there due to DBCS. I am not sure about UTF-8, but I think (not know
    exactly) it is kind of the same story (not done my homework today ;)).
    > I was thinking more along the lines of Win32 Unicode, which I 
    > do believe is nothing but evil, partly from a 
    > storage/protocol point of view, but mostly from a programming 
    > point of view. 
    Is it? We now support Unicode internally in almost all apps, and things
    have become much easier than when dealing with DBCS. At least, you know
    again have the idea that one "unit" (16 bytes in win32) is one
    "charcter". Sure, as soon as you hit the outside world, the fun begins
    ;) For example, it is especially "helpful" that Win32 emits SJIS (or was
    it JIS?) encoding by default when it translates Japanese Unicode while
    the Unix world expects EUC inward. Of course, you can use Win32 mapping
    function, but these rely heavily on Internet Explorer. So it might be
    better to roll your own translator. Once this is done, it is not so
    Is it a better experience on *nix? (hey, this is a honest question!
    Never worked with Unicode on *nix...).
    > I've been forced to deal with unicode in the 
    > past, only to get 
    > tripped up by such trivial facts as "how the HELL do you 
    > store a unicode string in an SQL database?  -- Whoops, can't 
    > be done, unless you store it as a blob, and then you can't 
    > search on it".
    That indeed is a big issue. You can store it in an on-the wire format,
    e.g. UTF-8 or SJIS. This will make at least the ANSI part searchable. Of
    course, it needs conversion when going to/from the database.
    Ah, and, yes, Microsoft SQL Server 7.x upwards has the NVARCHAR/NCHAR
    datatype which is native Unicode :-). I'd like to see this in more
    dabases AND in a consistent (preferable standard) way...
    > UTF-8 doesn't really have such problems.  It can be 
    > copied/stored/etc with normal string management routines, as 
    > long as you keep the 
    > string intact and don't truncate it.  Is this also the case 
    > with DBCS encoding?
    Yes, same story. Just make sure that you don't mess up with the byte
    values that look like control characters but are actually inside a
    lead/trail byte.
    LogAnalysis mailing list

    This archive was generated by hypermail 2b30 : Wed Jan 08 2003 - 08:32:19 PST