[Asterisk-Dev] IAX spec: Text formats and character sets

Sun May 1 11:29:48 MST 2005

On Sun, 1 May 2005, Steve Underwood wrote:

> To tell if two strings are equal (really equal, rather than just byte 
> for byte the same) you must bring both copies to canonical form. Very 
> little Unicode is in canonical form, so this is not a small twist. It is 
> a big PITA, that must be done in every case. The process of creating 
> canonical unicode is slow, complex, and takes lots of code. I am unclear 
> if insisting on canonical form is something you can really do, and if it 
> is a complete answer. You need to find a linguistics expert who knows 
> Unicode inside out to get a proper answer. There seems to be very little 
> software that does proper comparisons at present.

Careful, there a few processes that are separate and needs to be kept
separate to avoid confusion.

1) Each unicode codepoint (one 32-bit entity when expressed as UCS-4) can 
   be encoded in several forms in UTF-8. This mapping is well defined.

2) A single glyph is sometimes represented by several code points. Think 
   dead keys on a keyboard and you are not far off. Sometimes a character 
   exists both as a single combined mark and as a series of combining 
   code points, e.g. 'Å' and 'A'+ring. 

   To make comparisons easy both strings need to be in the same normal 
   form. There are two main ones, NFC in which precomposed single code
   points are used as much as possible and NFD in which characters are 
   decomposed as far as possible. NFC is the more popular and is the W3C 
   designated standard for the www. NFD is used on Macs.

   Normalization is a well defined process that can be performed fast by 
   existing libraries. Round-tripping from NFC and NFD is a lossless 
   process.

3) Some characters are defined as ligatures, e.g. the run-toghether "fi". 
   Under NFKC and NFKD these are replaced by their components, "f"+"i". 
   This may make comparisons behave more in line with what is expected, 
   but it is a lossy process that is best left until the actual 
   comparison.

4) A lot of nearly identical glyphs can be created. A programmer has to 
   decide what is close enough to count as a match. Unless the "equal of 
   the stream of normalized code points are equal" notion is used a lot of 
   local specific information is needed. 

5) Sorting is a big can of worms since no to languages agree on the 
   sorting order of glyphs.

> Having said this, for most data processing purposes this can be skipped, 
> and a byte by byte comparison used. If we just define that all text is 
> UTF-8, the only complexity which is unavoidable is ensuring strings do 
> not overrun buffers, while also do not stop mid-character. As others 
> have shown, this is trivial. UTF-8 can be scanned forwards and backwards 
> in a context free manner.

Well, you can find the code point boundaries easily enough. However, given 
the presence of combining characters things are a bit more difficult. 
However, tables and fast libraries for these functions already exists.

> Unicode is one of the classic botchups. They had the opportunity to 
> really clean things up. Instead they took all the mess from earlier 
> codes, and added a whole lot more. :-(

The main problem is that we in the west are used to manipulating a small
set of characters, ignoring their higher meaning (words and sentences). 
This became impossible with languages where the distinction between words 
and glyphs is smaller. 

Peter