[Asterisk-Dev] IAX spec: Text formats and character sets
Tzafrir Cohen
tzafrir.cohen at xorcom.com
Sun May 1 06:51:09 MST 2005
Hi
[ It seems my ISP has troubles getting my mail to this list, and thus
the delay :-( ]
On Sun, May 01, 2005 at 09:14:07PM +0800, Steve Underwood wrote:
> >Do you just wan't to tell if they're equal, or to sort them?
> >
> >Telling if they're eaul is basically simple: just compare the raw bytes.
> >One small twist: it may be required to use canonical unicode strings (I
> >hope I use the right term here). so you first convert them to a
> >canonical form and then compare them. Or simpler: mandrate all strings
> >to be in canonical form.
> >
> >Sorting is more complicated issue if you don't like the literal order.
> >
> >
> To tell if two strings are equal (really equal, rather than just byte
> for byte the same) you must bring both copies to canonical form. Very
> little Unicode is in canonical form, so this is not a small twist. It is
> a big PITA, that must be done in every case. The process of creating
> canonical unicode is slow, complex, and takes lots of code. I am unclear
> if insisting on canonical form is something you can really do, and if it
> is a complete answer. You need to find a linguistics expert who knows
> Unicode inside out to get a proper answer. There seems to be very little
> software that does proper comparisons at present.
I must admit I don't remember the unicode standards very well. But I
recall there are actually three levels of canonization.
Anyway, if the comparision is simple, and the task of converting to
cannonical form can be complicated, why not mandate a certain canonical
form? That is, caller id SHOULD be sent in cannonical form. Servers may
assume that it is in that form for the sake of comparison. What would
such a formalization break?
>
> Having said this, for most data processing purposes this can be skipped,
> and a byte by byte comparison used. If we just define that all text is
> UTF-8, the only complexity which is unavoidable is ensuring strings do
> not overrun buffers, while also do not stop mid-character. As others
> have shown, this is trivial. UTF-8 can be scanned forwards and backwards
> in a context free manner.
--
Tzafrir Cohen icq#16849755 +972-50-7952406
tzafrir.cohen at xorcom.com http://www.xorcom.com
More information about the asterisk-dev
mailing list