> > Well, isn't UTF-8 a kind of DBCS encoding? And have you > followed the > > limited acceptance Unicode receives in Japan. The problem are > > statements like yours IMHO. If I were Japanese, I wouldn't like to > > read the the encoding I need to use to make things working > is "evil". > > Japanese and chinese writing systems are evil, too :) </flamebait> ;-) > > Hrm, I might have gone a bit overboard there. DBCS using lead > bytes might still be easy to use (it doesn't insert NULs, does it?). No, it does not. From a conceptual point of view, it is very close to UTF-8. And there are no NULs in them <reliev />. We have worked with Japanese encodings, but from the docs I have this is quite the same for Chinese, Korean and Viatnamese, which belong to the same script family. However, there ARE control characters (with ANSI values below 0x20) in the stream. Nice things like CR and LF. You need to parse the lead/trail bytes to avoid accidently parsing them as terminators. Well, and here the standards trouble begin: it is simply *impossible* to write a (e.g.) RFC3164 compatible daemon supporting Japanse. RFC3164 prohibts CRLF in the message part but we do have these byte values in there due to DBCS. I am not sure about UTF-8, but I think (not know exactly) it is kind of the same story (not done my homework today ;)). > > I was thinking more along the lines of Win32 Unicode, which I > do believe is nothing but evil, partly from a > storage/protocol point of view, but mostly from a programming > point of view. Is it? We now support Unicode internally in almost all apps, and things have become much easier than when dealing with DBCS. At least, you know again have the idea that one "unit" (16 bytes in win32) is one "charcter". Sure, as soon as you hit the outside world, the fun begins ;) For example, it is especially "helpful" that Win32 emits SJIS (or was it JIS?) encoding by default when it translates Japanese Unicode while the Unix world expects EUC inward. Of course, you can use Win32 mapping function, but these rely heavily on Internet Explorer. So it might be better to roll your own translator. Once this is done, it is not so bad... Is it a better experience on *nix? (hey, this is a honest question! Never worked with Unicode on *nix...). > I've been forced to deal with unicode in the > past, only to get > tripped up by such trivial facts as "how the HELL do you > store a unicode string in an SQL database? -- Whoops, can't > be done, unless you store it as a blob, and then you can't > search on it". That indeed is a big issue. You can store it in an on-the wire format, e.g. UTF-8 or SJIS. This will make at least the ANSI part searchable. Of course, it needs conversion when going to/from the database. Ah, and, yes, Microsoft SQL Server 7.x upwards has the NVARCHAR/NCHAR datatype which is native Unicode :-). I'd like to see this in more dabases AND in a consistent (preferable standard) way... > > UTF-8 doesn't really have such problems. It can be > copied/stored/etc with normal string management routines, as > long as you keep the > string intact and don't truncate it. Is this also the case > with DBCS encoding? Yes, same story. Just make sure that you don't mess up with the byte values that look like control characters but are actually inside a lead/trail byte. Rainer _______________________________________________ LogAnalysis mailing list LogAnalysisat_private http://lists.shmoo.com/mailman/listinfo/loganalysis
This archive was generated by hypermail 2b30 : Wed Jan 08 2003 - 08:32:19 PST