Figure 8-2. The ???fi??? sequence shown without a ligature and with a ligature
244 | Chapter 8: i18n and L10n
To canonicalize sequences of code points, we must first determine what our notion
of equivalence is. Unicode defines two types of equivalence: the narrow canonical
equivalence and the broader compatibility equivalence. Canonical equivalence is limited
to characters that are equal in both form and function??”the standard example
being the decomposed ?¶ (the two code points o and ??) versus the precomposed character
?¶ (one code point). Two sequences of code points, such as those, that are
canonically equivalent are identical in appearance and usage, and can in nearly all
cases be substituted for each other.
Compatibility equivalence is a broader concept. Compatibility equivalence includes
all canonically equivalent characters, plus characters that may have different semantics
but are rendered similarly. Examples include the characters f and i versus the ?¬?
ligature, or the superscript 2 versus the ordinary numeral 2.
There are four methods of Unicode normalization: D, C, KD, and KC. (They are also
referred to as NFD, NFC, NFKD, and NFKC, with NF standing for Normalization
Form.) The D forms leave the string in a decomposed form, while the C forms leave
the string canonically composed (by first decomposing, and then recomposing by
canonical equivalence).
Pages:
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382