RE: [logs] Charset selection (Was: Re: EventLog library)

rgerhardsat_private

> > Well, isn't UTF-8 a kind of DBCS encoding? And have you 
> followed the 
> > limited acceptance Unicode receives in Japan. The problem are 
> > statements like yours IMHO. If I were Japanese, I wouldn't like to 
> > read the the encoding I need to use to make things working 
> is "evil".
> 
> Japanese and chinese writing systems are evil, too :)  </flamebait>

;-)

> 
> Hrm, I might have gone a bit overboard there. DBCS using lead 
> bytes might still be easy to use (it doesn't insert NULs, does it?).

No, it does not. From a conceptual point of view, it is very close to
UTF-8. And there are no NULs in them <reliev />. We have worked with
Japanese encodings, but from the docs I have this is quite the same for
Chinese, Korean and Viatnamese, which belong to the same script family.
However, there ARE control characters (with ANSI values below 0x20) in
the stream. Nice things like CR and LF. You need to parse the lead/trail
bytes to avoid accidently parsing them as terminators.

Well, and here the standards trouble begin: it is simply *impossible* to
write a (e.g.) RFC3164 compatible daemon supporting Japanse. RFC3164
prohibts CRLF in the message part but we do have these byte values in
there due to DBCS. I am not sure about UTF-8, but I think (not know
exactly) it is kind of the same story (not done my homework today ;)).

> 
> I was thinking more along the lines of Win32 Unicode, which I 
> do believe is nothing but evil, partly from a 
> storage/protocol point of view, but mostly from a programming 
> point of view. 

Is it? We now support Unicode internally in almost all apps, and things
have become much easier than when dealing with DBCS. At least, you know
again have the idea that one "unit" (16 bytes in win32) is one
"charcter". Sure, as soon as you hit the outside world, the fun begins
;) For example, it is especially "helpful" that Win32 emits SJIS (or was
it JIS?) encoding by default when it translates Japanese Unicode while
the Unix world expects EUC inward. Of course, you can use Win32 mapping
function, but these rely heavily on Internet Explorer. So it might be
better to roll your own translator. Once this is done, it is not so
bad...

Is it a better experience on *nix? (hey, this is a honest question!
Never worked with Unicode on *nix...).

> I've been forced to deal with unicode in the 
> past, only to get 
> tripped up by such trivial facts as "how the HELL do you 
> store a unicode string in an SQL database?  -- Whoops, can't 
> be done, unless you store it as a blob, and then you can't 
> search on it".

That indeed is a big issue. You can store it in an on-the wire format,
e.g. UTF-8 or SJIS. This will make at least the ANSI part searchable. Of
course, it needs conversion when going to/from the database.

Ah, and, yes, Microsoft SQL Server 7.x upwards has the NVARCHAR/NCHAR
datatype which is native Unicode :-). I'd like to see this in more
dabases AND in a consistent (preferable standard) way...

> 
> UTF-8 doesn't really have such problems.  It can be 
> copied/stored/etc with normal string management routines, as 
> long as you keep the 
> string intact and don't truncate it.  Is this also the case 
> with DBCS encoding?

Yes, same story. Just make sure that you don't mess up with the byte
values that look like control characters but are actually inside a
lead/trail byte.

Rainer
_______________________________________________
LogAnalysis mailing list
LogAnalysisat_private
http://lists.shmoo.com/mailman/listinfo/loganalysis