Prev | Current Page 370 | Next

Brad Ediger

"Advanced Rails"


Figure 8-2. The ???fi??? sequence shown without a ligature and with a ligature
244 | Chapter 8: i18n and L10n
To canonicalize sequences of code points, we must first determine what our notion
of equivalence is. Unicode defines two types of equivalence: the narrow canonical
equivalence and the broader compatibility equivalence. Canonical equivalence is limited
to characters that are equal in both form and function??”the standard example
being the decomposed ?¶ (the two code points o and ??) versus the precomposed character
?¶ (one code point). Two sequences of code points, such as those, that are
canonically equivalent are identical in appearance and usage, and can in nearly all
cases be substituted for each other.
Compatibility equivalence is a broader concept. Compatibility equivalence includes
all canonically equivalent characters, plus characters that may have different semantics
but are rendered similarly. Examples include the characters f and i versus the ?¬?
ligature, or the superscript 2 versus the ordinary numeral 2.
There are four methods of Unicode normalization: D, C, KD, and KC. (They are also
referred to as NFD, NFC, NFKD, and NFKC, with NF standing for Normalization
Form.) The D forms leave the string in a decomposed form, while the C forms leave
the string canonically composed (by first decomposing, and then recomposing by
canonical equivalence).


Pages:
358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382