[Asterisk-Dev] IAX spec: Text formats and character sets

Sun May 1 06:14:07 MST 2005

Tzafrir Cohen wrote:

>On Sat, Apr 30, 2005 at 06:07:16PM +0800, Steve Underwood wrote:
>  
>
>>Michael Giagnocavo wrote:
>>
>>    
>>
>>>>Michael Giagnocavo wrote:
>>>>  
>>>>
>>>>        
>>>>
>>>>>Hmm, you're right. That's doesn't look bad at all.
>>>>>
>>>>>But... what about for comparisons and other Unicode operations? Do the
>>>>>libraries available support some UTF-8 version of strcmp, strchr,
>>>>>strcasecmp, etc.?
>>>>>
>>>>>    
>>>>>
>>>>>          
>>>>>
>>>>Some of them are easy (strcmp, for example). Most of them are harder, 
>>>>because they either need to know character boundaries, or need case 
>>>>mappings (strcasecmp, for example). Any function that searches for a 
>>>>'char' in a string also won't work if the character being searched for 
>>>>is a multi-byte one.
>>>>  
>>>>
>>>>        
>>>>
>>>Not even strcmp works, because you have things like combinations where you
>>>can represent in Unicode a character using different code points, but it's
>>>still considered the same. Say, a Latin o with an accent mark. Using wide
>>>char internally solves these issues, and is most likely faster, depending 
>>>on
>>>the data.
>>>
>>>
>>>      
>>>
>>Too right. Look at IBM's internationalisation classes for Unicode. It 
>>takes megabytes of code to compare two strings.
>>    
>>
>
>Do you just wan't to tell if they're equal, or to sort them?
>
>Telling if they're eaul is basically simple: just compare the raw bytes.
>One small twist: it may be required to use canonical unicode strings (I
>hope I use the right term here). so you first convert them to a
>canonical form and then compare them. Or simpler: mandrate all strings
>to be in canonical form.
>
>Sorting is more complicated issue if you don't like the literal order.
>  
>
To tell if two strings are equal (really equal, rather than just byte 
for byte the same) you must bring both copies to canonical form. Very 
little Unicode is in canonical form, so this is not a small twist. It is 
a big PITA, that must be done in every case. The process of creating 
canonical unicode is slow, complex, and takes lots of code. I am unclear 
if insisting on canonical form is something you can really do, and if it 
is a complete answer. You need to find a linguistics expert who knows 
Unicode inside out to get a proper answer. There seems to be very little 
software that does proper comparisons at present.

Having said this, for most data processing purposes this can be skipped, 
and a byte by byte comparison used. If we just define that all text is 
UTF-8, the only complexity which is unavoidable is ensuring strings do 
not overrun buffers, while also do not stop mid-character. As others 
have shown, this is trivial. UTF-8 can be scanned forwards and backwards 
in a context free manner.

Unicode is one of the classic botchups. They had the opportunity to 
really clean things up. Instead they took all the mess from earlier 
codes, and added a whole lot more. :-(

Regards,
Steve