Re: [logs] Charset selection (Was: Re: EventLog library)

Previous message: Darren Reed: "Re: [logs] syslog/tcp (selp)"
In reply to: Rainer Gerhards: "RE: [logs] Charset selection (Was: Re: EventLog library)"
Next in thread: Rainer Gerhards: "RE: [logs] Charset selection (Was: Re: EventLog library)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

mikael.olssonat_private

Rainer Gerhards wrote:
> 
> From a conceptual point of view, DBCS is very close to
> UTF-8. And there are no NULs in them <reliev />. We have worked with
> Japanese encodings, but from the docs I have this is quite the same for
> Chinese, Korean and Viatnamese, which belong to the same script family.
> However, there ARE control characters (with ANSI values below 0x20) in
> the stream. Nice things like CR and LF. You need to parse the lead/trail
> bytes to avoid accidently parsing them as terminators.

Just a quick FYI here:

UTF-8 doesn't have these (C0) control character issues. All bytes
produced by UTF-8 encoding have the highest bit set. [1]

I'm thinking that this is one of the strong reasons why UTF-8,
rather than DBCS, is making its way into Internet standards, 
and DBCS isn't.

-- 
Mikael Olsson, Clavister AB
Storgatan 12, Box 393, SE-891 28 ÖRNSKÖLDSVIK, Sweden
Phone: +46 (0)660 29 92 00   Mobile: +46 (0)70 26 222 05
Fax: +46 (0)660 122 50       WWW: http://www.clavister.com

[1] It _can_ of course produce C1 control characters (0x80-0x9f), but 
    that tends to be much less of a problem as far as automated 
    parsing is concerned.
_______________________________________________
LogAnalysis mailing list
LogAnalysisat_private
http://lists.shmoo.com/mailman/listinfo/loganalysis