[Asterisk-Dev] IAX spec: Text formats and character sets

Tzafrir Cohen tzafrir.cohen at xorcom.com
Sun May 1 06:51:09 MST 2005


Hi

[ It seems my ISP has troubles getting my mail to this list, and thus
the delay :-( ]

On Sun, May 01, 2005 at 09:14:07PM +0800, Steve Underwood wrote:

> >Do you just wan't to tell if they're equal, or to sort them?
> >
> >Telling if they're eaul is basically simple: just compare the raw bytes.
> >One small twist: it may be required to use canonical unicode strings (I
> >hope I use the right term here). so you first convert them to a
> >canonical form and then compare them. Or simpler: mandrate all strings
> >to be in canonical form.
> >
> >Sorting is more complicated issue if you don't like the literal order.
> > 
> >
> To tell if two strings are equal (really equal, rather than just byte 
> for byte the same) you must bring both copies to canonical form. Very 
> little Unicode is in canonical form, so this is not a small twist. It is 
> a big PITA, that must be done in every case. The process of creating 
> canonical unicode is slow, complex, and takes lots of code. I am unclear 
> if insisting on canonical form is something you can really do, and if it 
> is a complete answer. You need to find a linguistics expert who knows 
> Unicode inside out to get a proper answer. There seems to be very little 
> software that does proper comparisons at present.

I must admit I don't remember the unicode standards very well. But I
recall there are actually three levels of canonization. 

Anyway, if the comparision is simple, and the task of converting to
cannonical form can be complicated, why not mandate a certain canonical
form? That is, caller id SHOULD be sent in cannonical form. Servers may
assume that it is in that form for the sake of comparison. What would
such a formalization break?

> 
> Having said this, for most data processing purposes this can be skipped, 
> and a byte by byte comparison used. If we just define that all text is 
> UTF-8, the only complexity which is unavoidable is ensuring strings do 
> not overrun buffers, while also do not stop mid-character. As others 
> have shown, this is trivial. UTF-8 can be scanned forwards and backwards 
> in a context free manner.

-- 
Tzafrir Cohen     icq#16849755  +972-50-7952406
tzafrir.cohen at xorcom.com  http://www.xorcom.com



More information about the asterisk-dev mailing list