Rainer Gerhards wrote: > > From a conceptual point of view, DBCS is very close to > UTF-8. And there are no NULs in them <reliev />. We have worked with > Japanese encodings, but from the docs I have this is quite the same for > Chinese, Korean and Viatnamese, which belong to the same script family. > However, there ARE control characters (with ANSI values below 0x20) in > the stream. Nice things like CR and LF. You need to parse the lead/trail > bytes to avoid accidently parsing them as terminators. Just a quick FYI here: UTF-8 doesn't have these (C0) control character issues. All bytes produced by UTF-8 encoding have the highest bit set. [1] I'm thinking that this is one of the strong reasons why UTF-8, rather than DBCS, is making its way into Internet standards, and DBCS isn't. -- Mikael Olsson, Clavister AB Storgatan 12, Box 393, SE-891 28 ÖRNSKÖLDSVIK, Sweden Phone: +46 (0)660 29 92 00 Mobile: +46 (0)70 26 222 05 Fax: +46 (0)660 122 50 WWW: http://www.clavister.com [1] It _can_ of course produce C1 control characters (0x80-0x9f), but that tends to be much less of a problem as far as automated parsing is concerned. _______________________________________________ LogAnalysis mailing list LogAnalysisat_private http://lists.shmoo.com/mailman/listinfo/loganalysis
This archive was generated by hypermail 2b30 : Thu Jan 09 2003 - 16:31:35 PST