Though the distinction between graphemes and glyphs is relatively easy to make for
English, it can be very difficult and occasionally political for Han characters (the
ideographs common to CJKV languages).??
Unicode Transformation Formats
One of the key factors driving the adoption of Unicode is UTF-8 (8-bit Unicode
Transformation Format). UTF-8 has several clever features (some would call them
compromises) that make it attractive to those who are used to working with ASCII or
Latin-1 text:
* In this chapter, I use grapheme and character synonymously.
Figure 8-1. Alternative glyphs representing the ???a??? grapheme
?? See http://en.wikipedia.org/wiki/Han_unification for one aspect of this situation.
240 | Chapter 8: i18n and L10n
??? In UTF-8, text that only uses standard ASCII characters is byte-for-byte identical
to its ASCII encoding. UTF-8 ensures that the encoding of every code point above
U+007F begins with a high-ASCII character (with a most significant bit of 1).
??? Because of this, a UTF-8 encoded string will never contain the null byte (0x00),
except as the encoding of the code point U+0000.
??? UTF-8 is somewhat self-synchronizing, which makes it resilient to error. Each
type of byte in UTF-8 (single-byte character, first byte of a multibyte character,
and subsequent bytes of a multibyte character) can be distinguished by its prefix.
Therefore, you can start at any byte point in a string and find the next character
without working backward.
Pages:
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375