[Asterisk-Dev] IAX spec: Text formats and character sets

Steve Underwood steveu at coppice.org
Sun May 1 19:19:00 MST 2005


Peter Svensson wrote:

>On Sun, 1 May 2005, Steve Underwood wrote:
>
>  
>
>>To tell if two strings are equal (really equal, rather than just byte 
>>for byte the same) you must bring both copies to canonical form. Very 
>>little Unicode is in canonical form, so this is not a small twist. It is 
>>a big PITA, that must be done in every case. The process of creating 
>>canonical unicode is slow, complex, and takes lots of code. I am unclear 
>>if insisting on canonical form is something you can really do, and if it 
>>is a complete answer. You need to find a linguistics expert who knows 
>>Unicode inside out to get a proper answer. There seems to be very little 
>>software that does proper comparisons at present.
>>    
>>
>
>Careful, there a few processes that are separate and needs to be kept
>separate to avoid confusion.
>
>1) Each unicode codepoint (one 32-bit entity when expressed as UCS-4) can 
>   be encoded in several forms in UTF-8. This mapping is well defined.
>  
>
This is one of the few well defined things in Unicode. A 32 bit value 
maps to a very well defined series of bytes in UTF-8. There is only one 
form for each code point.

>2) A single glyph is sometimes represented by several code points. Think 
>   dead keys on a keyboard and you are not far off. Sometimes a character 
>   exists both as a single combined mark and as a series of combining 
>   code points, e.g. 'Å' and 'A'+ring. 
>
>   To make comparisons easy both strings need to be in the same normal 
>   form. There are two main ones, NFC in which precomposed single code
>   points are used as much as possible and NFD in which characters are 
>   decomposed as far as possible. NFC is the more popular and is the W3C 
>   designated standard for the www. NFD is used on Macs.
>
>   Normalization is a well defined process that can be performed fast by 
>   existing libraries. Round-tripping from NFC and NFD is a lossless 
>   process.
>
>3) Some characters are defined as ligatures, e.g. the run-toghether "fi". 
>   Under NFKC and NFKD these are replaced by their components, "f"+"i". 
>   This may make comparisons behave more in line with what is expected, 
>   but it is a lossy process that is best left until the actual 
>   comparison.
>  
>
A messier problem is the Indic languages, where the characters within a 
word can be completely shuffled, and need reordering for matching purposes.

>4) A lot of nearly identical glyphs can be created. A programmer has to 
>   decide what is close enough to count as a match. Unless the "equal of 
>   the stream of normalized code points are equal" notion is used a lot of 
>   local specific information is needed. 
>  
>
The East Asian languages have a lot of these.

>5) Sorting is a big can of worms since no to languages agree on the 
>   sorting order of glyphs.
>  
>
True, but sorting is probably beyond what anyone cares about here. 
Looking for matches is the important issue in telephony.

>>Having said this, for most data processing purposes this can be skipped, 
>>and a byte by byte comparison used. If we just define that all text is 
>>UTF-8, the only complexity which is unavoidable is ensuring strings do 
>>not overrun buffers, while also do not stop mid-character. As others 
>>have shown, this is trivial. UTF-8 can be scanned forwards and backwards 
>>in a context free manner.
>>    
>>
>
>Well, you can find the code point boundaries easily enough. However, given 
>the presence of combining characters things are a bit more difficult. 
>However, tables and fast libraries for these functions already exists.
>  
>
I think the main issue in telephony is recognising callers. The same 
caller-id will appear each time the same person calls. Therefore, I 
think a pure byte by byte match of the strings will be sufficient for 
90+% of compaisons.

>>Unicode is one of the classic botchups. They had the opportunity to 
>>really clean things up. Instead they took all the mess from earlier 
>>codes, and added a whole lot more. :-(
>>    
>>
>
>The main problem is that we in the west are used to manipulating a small
>set of characters, ignoring their higher meaning (words and sentences). 
>This became impossible with languages where the distinction between words 
>and glyphs is smaller. 
>  
>
You mean "I in the west". I'm not in the west, and I'm used to 
manipulating Chinese on computers. :-) I think most non-Romance 
languages were botched in Unicode, though. Each time I hear about the 
needs of a new language, it seems Unicode does a poor job of handling 
it. Deciding that UCS-2 made sense was a western imposition on a world 
with rather more than 64K characters, and it eventually backfired.

I think you have a very relaxed view of what the word "fast" means. :-)

Regards,
Steve




More information about the asterisk-dev mailing list