[Asterisk-Dev] IAX spec: Text formats and character sets

Sat Apr 30 04:12:59 MST 2005

On Sat, Apr 30, 2005 at 06:07:16PM +0800, Steve Underwood wrote:
> Michael Giagnocavo wrote:
> 
> >>Michael Giagnocavo wrote:
> >>   
> >>
> >>>Hmm, you're right. That's doesn't look bad at all.
> >>>
> >>>But... what about for comparisons and other Unicode operations? Do the
> >>>libraries available support some UTF-8 version of strcmp, strchr,
> >>>strcasecmp, etc.?
> >>>
> >>>     
> >>>
> >>Some of them are easy (strcmp, for example). Most of them are harder, 
> >>because they either need to know character boundaries, or need case 
> >>mappings (strcasecmp, for example). Any function that searches for a 
> >>'char' in a string also won't work if the character being searched for 
> >>is a multi-byte one.
> >>   
> >>
> >
> >Not even strcmp works, because you have things like combinations where you
> >can represent in Unicode a character using different code points, but it's
> >still considered the same. Say, a Latin o with an accent mark. Using wide
> >char internally solves these issues, and is most likely faster, depending 
> >on
> >the data.
> > 
> >
> Too right. Look at IBM's internationalisation classes for Unicode. It 
> takes megabytes of code to compare two strings.

Do you just wan't to tell if they're equal, or to sort them?

Telling if they're eaul is basically simple: just compare the raw bytes.
One small twist: it may be required to use canonical unicode strings (I
hope I use the right term here). so you first convert them to a
canonical form and then compare them. Or simpler: mandrate all strings
to be in canonical form.

Sorting is more complicated issue if you don't like the literal order.

-- 
Tzafrir Cohen     icq#16849755  +972-50-7952406
tzafrir.cohen at xorcom.com  http://www.xorcom.com