[Asterisk-Dev] IAX spec: Text formats and character sets
Steve Underwood
steveu at coppice.org
Sun May 1 06:14:07 MST 2005
Tzafrir Cohen wrote:
>On Sat, Apr 30, 2005 at 06:07:16PM +0800, Steve Underwood wrote:
>
>
>>Michael Giagnocavo wrote:
>>
>>
>>
>>>>Michael Giagnocavo wrote:
>>>>
>>>>
>>>>
>>>>
>>>>>Hmm, you're right. That's doesn't look bad at all.
>>>>>
>>>>>But... what about for comparisons and other Unicode operations? Do the
>>>>>libraries available support some UTF-8 version of strcmp, strchr,
>>>>>strcasecmp, etc.?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>Some of them are easy (strcmp, for example). Most of them are harder,
>>>>because they either need to know character boundaries, or need case
>>>>mappings (strcasecmp, for example). Any function that searches for a
>>>>'char' in a string also won't work if the character being searched for
>>>>is a multi-byte one.
>>>>
>>>>
>>>>
>>>>
>>>Not even strcmp works, because you have things like combinations where you
>>>can represent in Unicode a character using different code points, but it's
>>>still considered the same. Say, a Latin o with an accent mark. Using wide
>>>char internally solves these issues, and is most likely faster, depending
>>>on
>>>the data.
>>>
>>>
>>>
>>>
>>Too right. Look at IBM's internationalisation classes for Unicode. It
>>takes megabytes of code to compare two strings.
>>
>>
>
>Do you just wan't to tell if they're equal, or to sort them?
>
>Telling if they're eaul is basically simple: just compare the raw bytes.
>One small twist: it may be required to use canonical unicode strings (I
>hope I use the right term here). so you first convert them to a
>canonical form and then compare them. Or simpler: mandrate all strings
>to be in canonical form.
>
>Sorting is more complicated issue if you don't like the literal order.
>
>
To tell if two strings are equal (really equal, rather than just byte
for byte the same) you must bring both copies to canonical form. Very
little Unicode is in canonical form, so this is not a small twist. It is
a big PITA, that must be done in every case. The process of creating
canonical unicode is slow, complex, and takes lots of code. I am unclear
if insisting on canonical form is something you can really do, and if it
is a complete answer. You need to find a linguistics expert who knows
Unicode inside out to get a proper answer. There seems to be very little
software that does proper comparisons at present.
Having said this, for most data processing purposes this can be skipped,
and a byte by byte comparison used. If we just define that all text is
UTF-8, the only complexity which is unavoidable is ensuring strings do
not overrun buffers, while also do not stop mid-character. As others
have shown, this is trivial. UTF-8 can be scanned forwards and backwards
in a context free manner.
Unicode is one of the classic botchups. They had the opportunity to
really clean things up. Instead they took all the mess from earlier
codes, and added a whole lot more. :-(
Regards,
Steve
More information about the asterisk-dev
mailing list