Re: [logs] Charset selection (Was: Re: EventLog library)

From: Mikael Olsson (mikael.olssonat_private)
Date: Thu Jan 09 2003 - 16:13:18 PST

  • Next message: Mikael Olsson: "Re: [logs] Trial SELP client implementation"

    Rainer Gerhards wrote:
    > 
    > From a conceptual point of view, DBCS is very close to
    > UTF-8. And there are no NULs in them <reliev />. We have worked with
    > Japanese encodings, but from the docs I have this is quite the same for
    > Chinese, Korean and Viatnamese, which belong to the same script family.
    > However, there ARE control characters (with ANSI values below 0x20) in
    > the stream. Nice things like CR and LF. You need to parse the lead/trail
    > bytes to avoid accidently parsing them as terminators.
    
    Just a quick FYI here:
    
    UTF-8 doesn't have these (C0) control character issues. All bytes
    produced by UTF-8 encoding have the highest bit set. [1]
    
    I'm thinking that this is one of the strong reasons why UTF-8,
    rather than DBCS, is making its way into Internet standards, 
    and DBCS isn't.
    
    -- 
    Mikael Olsson, Clavister AB
    Storgatan 12, Box 393, SE-891 28 ÖRNSKÖLDSVIK, Sweden
    Phone: +46 (0)660 29 92 00   Mobile: +46 (0)70 26 222 05
    Fax: +46 (0)660 122 50       WWW: http://www.clavister.com
    
    [1] It _can_ of course produce C1 control characters (0x80-0x9f), but 
        that tends to be much less of a problem as far as automated 
        parsing is concerned.
    _______________________________________________
    LogAnalysis mailing list
    LogAnalysisat_private
    http://lists.shmoo.com/mailman/listinfo/loganalysis
    



    This archive was generated by hypermail 2b30 : Thu Jan 09 2003 - 16:31:35 PST