The K forms decompose by compatibility equivalence, while
those without a K decompose by canonical equivalence. (All composition is done
under canonical equivalence to ensure a consistent composition.)
ActiveSupport provides methods on the UTF-8 handler for Unicode normalization,
supporting all four forms. The following code shows the differences between the four
forms as applied to the string ?¬?nal pi?±ata. The first word includes the ?¬? ligature,
which is compatibility equivalent (but not canonically equivalent) to the separated
characters fi. The second word includes the character ?±, which is both compatibility
equivalent and canonically equivalent to the code points n and ??.
$KCODE = 'u'
str = "?¬?nal pi?±ata".chars
str.normalize(:d).to_s # => "?¬?nal pin??ata"
str.normalize(:c).to_s # => "?¬?nal pi?±ata"
str.normalize(:kd).to_s # => "final pin??ata"
str.normalize(:kc).to_s # => "final pi?±ata"
Filtering UTF-8 Input
Although you may be UTF-8 clean through your entire system (UTF-8 text can be
entered anywhere and is displayed identically upon output), you are still at risk of
problems if you just accept user-provided strings as UTF-8. Users can provide invalid
UTF-8 text (not all byte sequences correspond to valid sequences of UTF-8 code
points). Users will even provide maliciously malformed UTF-8 text in an attempt to
crash or exploit your string-processing functions.
Pages:
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383