[Asterisk-Dev] IAX spec: Text formats and character sets
Peter Svensson
psvasterisk at psv.nu
Sun May 1 11:29:48 MST 2005
On Sun, 1 May 2005, Steve Underwood wrote:
> To tell if two strings are equal (really equal, rather than just byte
> for byte the same) you must bring both copies to canonical form. Very
> little Unicode is in canonical form, so this is not a small twist. It is
> a big PITA, that must be done in every case. The process of creating
> canonical unicode is slow, complex, and takes lots of code. I am unclear
> if insisting on canonical form is something you can really do, and if it
> is a complete answer. You need to find a linguistics expert who knows
> Unicode inside out to get a proper answer. There seems to be very little
> software that does proper comparisons at present.
Careful, there a few processes that are separate and needs to be kept
separate to avoid confusion.
1) Each unicode codepoint (one 32-bit entity when expressed as UCS-4) can
be encoded in several forms in UTF-8. This mapping is well defined.
2) A single glyph is sometimes represented by several code points. Think
dead keys on a keyboard and you are not far off. Sometimes a character
exists both as a single combined mark and as a series of combining
code points, e.g. 'Å' and 'A'+ring.
To make comparisons easy both strings need to be in the same normal
form. There are two main ones, NFC in which precomposed single code
points are used as much as possible and NFD in which characters are
decomposed as far as possible. NFC is the more popular and is the W3C
designated standard for the www. NFD is used on Macs.
Normalization is a well defined process that can be performed fast by
existing libraries. Round-tripping from NFC and NFD is a lossless
process.
3) Some characters are defined as ligatures, e.g. the run-toghether "fi".
Under NFKC and NFKD these are replaced by their components, "f"+"i".
This may make comparisons behave more in line with what is expected,
but it is a lossy process that is best left until the actual
comparison.
4) A lot of nearly identical glyphs can be created. A programmer has to
decide what is close enough to count as a match. Unless the "equal of
the stream of normalized code points are equal" notion is used a lot of
local specific information is needed.
5) Sorting is a big can of worms since no to languages agree on the
sorting order of glyphs.
> Having said this, for most data processing purposes this can be skipped,
> and a byte by byte comparison used. If we just define that all text is
> UTF-8, the only complexity which is unavoidable is ensuring strings do
> not overrun buffers, while also do not stop mid-character. As others
> have shown, this is trivial. UTF-8 can be scanned forwards and backwards
> in a context free manner.
Well, you can find the code point boundaries easily enough. However, given
the presence of combining characters things are a bit more difficult.
However, tables and fast libraries for these functions already exists.
> Unicode is one of the classic botchups. They had the opportunity to
> really clean things up. Instead they took all the mess from earlier
> codes, and added a whole lot more. :-(
The main problem is that we in the west are used to manipulating a small
set of characters, ignoring their higher meaning (words and sentences).
This became impossible with languages where the distinction between words
and glyphs is smaller.
Peter
More information about the asterisk-dev
mailing list