[Asterisk-Dev] IAX spec: Text formats and character sets

Kristian Nielsen kn at sifira.dk
Fri Apr 29 08:15:08 MST 2005


"Kevin P. Fleming" <kpfleming at digium.com> writes:

> Kristian Nielsen wrote:
> 
> > Well, it is easy to implement our own strncpy_utf8() that copies only up
> > to and including the last utf-8 character not going over the maximum
> > specified byte length. Then we could also fix it to actually
> > zero-terminate the copy (strncpy() doesn't always zero-terminate the
> > destination as I am _sure_ everyone remebers :-).
> 
> I think 'easy' is an overstatement here. Any function that does this
> needs to understand the _entire_ UTF-8 space to know which characters
> are multibyte, and how many bytes they take up. This is not trivial,
> although it's also not very complicated... just some tables and
> keeping track of where you are so you can backtrack if needed.

It is much easier than that because of some of the nice properties of
utf-8. All single byte characters are in the interval 0x00-0x7F, and all
multibyte characters start with a byte in the interval 0xC0-0xFF and
continue with bytes in the interval 0x80-0xBF.

Thus if you are about to truncate an utf-8 string, you should check the
first byte that is dropped. If this is in the interval 0x80-0xBF, you
must backtrack until you get past a byte in the interval 0xC0-0xFF.

Thus if we need to truncate this after 7 bytes:

  41 42 43 D0 9B F0 A3 A5
                      ^
we note that the next byte is A5, so we backtrack until we get past F0,
and the result is:

  41 42 43 D0 9B

This will ensure correct utf-8, provided the original was valid utf-8.

> The bigger issue is the performance hit this function will cause... if
> we do it at all, it will have to be compile-time selectable as to
> whether is uses raw strncpy() or utf8strnpcy().

There should be little performance hit, only loop over a couple of
characters in the rare case where the string is truncated. In fact I
would think we would see a big performance gain from using our own
function instead of strncpy(), since we could use something that only
copies as many bytes as are in the source string. strncpy() pads the
destination with zeros, so strncpy(dest, "foo", 100) will copy 100
bytes, not just the three characters in "foo".

 - Kristian.

-- 
Kristian Nielsen   kn at sifira.dk
Development Manager, Sifira A/S




More information about the asterisk-dev mailing list