[Asterisk-Dev] IAX spec: Text formats and character sets
Michael Giagnocavo
mgg-digium at atrevido.net
Fri Apr 29 09:41:45 MST 2005
>> Well, what if you use wide chars? UTF-8 is great for a
common-denominator,
>> on-the-wire format, but it's less than ideal for manipulation. With =
wide
>> chars with you can use wcsncpy and the rest of the wc* functions.=20
>
>That also opens another can of worms, as they say...
>
>When we are splitting a string on '@', for example, there may be=20
>multiple perfectly valid UTF-8 representations of that glyph, so we =
have=20
>to determine which ones we consider to be acceptable, if not all of =
them.
Well, consider this:
=A7=AE - U+041C Cyrillic Capital Letter Em
M - U+004D Latin Capital M
They are "the same"? They look similar but aren't. Same for the at. You
have:
@ - U+0040 Commercial At
=A9=88- U+FE6B Small Commercial At
=A3=C0 - U+FF20 Fullwidth Commercial At
I'd say that for a protocol like SIP, it'd accept the Commercial At and
that's that.=20
Then there's stuff like combining characters, so that some of your =
accented
characters actually do have different ways of coming up with the same =
char.
Either it's a hardcoded char (like some vowel with an accent mark) or =
it's a
sequence of two codepoints, the letter and separately, the accent. I =
guess
you just have to hope that your library does it correctly.=20
Unicode is terribly complex and I highly doubt glibc gets it right all =
over
the place. (Heck, even Windows has a few issues with some APIs.) But I =
don't
think it'd hurt us in anything SIP related. If someone is sending =
usernames
as Fullwidth characters, or wants, say, Kana-insensitive matching on
usernames, I think they are out of luck :).=20
Now if you were referring to that UTF-8 could come up with U+0040 =
Commercial
At via some other byte sequence other than 0x40, your UTF-8 library =
should
handle that correctly and not allow that to happen.=20
See this for lots of good info:
http://www.cl.cam.ac.uk/~mgk25/unicode.html
-Michael
More information about the asterisk-dev
mailing list