[Asterisk-Dev] IAX spec: Text formats and character sets

Michael Giagnocavo mgg-digium at atrevido.net
Fri Apr 29 09:41:45 MST 2005


>> Well, what if you use wide chars? UTF-8 is great for a
common-denominator,
>> on-the-wire format, but it's less than ideal for manipulation. With =
wide
>> chars with you can use wcsncpy and the rest of the wc* functions.=20
>
>That also opens another can of worms, as they say...
>
>When we are splitting a string on '@', for example, there may be=20
>multiple perfectly valid UTF-8 representations of that glyph, so we =
have=20
>to determine which ones we consider to be acceptable, if not all of =
them.

Well, consider this:
=A7=AE - U+041C Cyrillic Capital Letter Em
M - U+004D Latin Capital M

They are "the same"? They look similar but aren't. Same for the at. You
have:
@ - U+0040 Commercial At
=A9=88- U+FE6B Small Commercial At
=A3=C0 - U+FF20 Fullwidth Commercial At

I'd say that for a protocol like SIP, it'd accept the Commercial At and
that's that.=20

Then there's stuff like combining characters, so that some of your =
accented
characters actually do have different ways of coming up with the same =
char.
Either it's a hardcoded char (like some vowel with an accent mark) or =
it's a
sequence of two codepoints, the letter and separately, the accent. I =
guess
you just have to hope that your library does it correctly.=20

Unicode is terribly complex and I highly doubt glibc gets it right all =
over
the place. (Heck, even Windows has a few issues with some APIs.) But I =
don't
think it'd hurt us in anything SIP related. If someone is sending =
usernames
as Fullwidth characters, or wants, say, Kana-insensitive matching on
usernames, I think they are out of luck :).=20

Now if you were referring to that UTF-8 could come up with U+0040 =
Commercial
At via some other byte sequence other than 0x40, your UTF-8 library =
should
handle that correctly and not allow that to happen.=20

See this for lots of good info:
http://www.cl.cam.ac.uk/~mgk25/unicode.html

-Michael





More information about the asterisk-dev mailing list