[Asterisk-Dev] IAX spec: Text formats and character sets
Tzafrir Cohen
tzafrir.cohen at xorcom.com
Sat Apr 30 04:12:59 MST 2005
On Sat, Apr 30, 2005 at 06:07:16PM +0800, Steve Underwood wrote:
> Michael Giagnocavo wrote:
>
> >>Michael Giagnocavo wrote:
> >>
> >>
> >>>Hmm, you're right. That's doesn't look bad at all.
> >>>
> >>>But... what about for comparisons and other Unicode operations? Do the
> >>>libraries available support some UTF-8 version of strcmp, strchr,
> >>>strcasecmp, etc.?
> >>>
> >>>
> >>>
> >>Some of them are easy (strcmp, for example). Most of them are harder,
> >>because they either need to know character boundaries, or need case
> >>mappings (strcasecmp, for example). Any function that searches for a
> >>'char' in a string also won't work if the character being searched for
> >>is a multi-byte one.
> >>
> >>
> >
> >Not even strcmp works, because you have things like combinations where you
> >can represent in Unicode a character using different code points, but it's
> >still considered the same. Say, a Latin o with an accent mark. Using wide
> >char internally solves these issues, and is most likely faster, depending
> >on
> >the data.
> >
> >
> Too right. Look at IBM's internationalisation classes for Unicode. It
> takes megabytes of code to compare two strings.
Do you just wan't to tell if they're equal, or to sort them?
Telling if they're eaul is basically simple: just compare the raw bytes.
One small twist: it may be required to use canonical unicode strings (I
hope I use the right term here). so you first convert them to a
canonical form and then compare them. Or simpler: mandrate all strings
to be in canonical form.
Sorting is more complicated issue if you don't like the literal order.
--
Tzafrir Cohen icq#16849755 +972-50-7952406
tzafrir.cohen at xorcom.com http://www.xorcom.com
More information about the asterisk-dev
mailing list