Prev | Current Page 363 | Next

Brad Ediger

"Advanced Rails"


Though the distinction between graphemes and glyphs is relatively easy to make for
English, it can be very difficult and occasionally political for Han characters (the
ideographs common to CJKV languages).?? 
Unicode Transformation Formats
One of the key factors driving the adoption of Unicode is UTF-8 (8-bit Unicode
Transformation Format). UTF-8 has several clever features (some would call them
compromises) that make it attractive to those who are used to working with ASCII or
Latin-1 text:
* In this chapter, I use grapheme and character synonymously.
Figure 8-1. Alternative glyphs representing the ???a??? grapheme
??  See http://en.wikipedia.org/wiki/Han_unification for one aspect of this situation.
240 | Chapter 8: i18n and L10n
??? In UTF-8, text that only uses standard ASCII characters is byte-for-byte identical
to its ASCII encoding. UTF-8 ensures that the encoding of every code point above
U+007F begins with a high-ASCII character (with a most significant bit of 1).
??? Because of this, a UTF-8 encoded string will never contain the null byte (0x00),
except as the encoding of the code point U+0000.
??? UTF-8 is somewhat self-synchronizing, which makes it resilient to error. Each
type of byte in UTF-8 (single-byte character, first byte of a multibyte character,
and subsequent bytes of a multibyte character) can be distinguished by its prefix.
Therefore, you can start at any byte point in a string and find the next character
without working backward.


Pages:
351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375