Prev | Current Page 371 | Next

Brad Ediger

"Advanced Rails"

The K forms decompose by compatibility equivalence, while
those without a K decompose by canonical equivalence. (All composition is done
under canonical equivalence to ensure a consistent composition.)
ActiveSupport provides methods on the UTF-8 handler for Unicode normalization,
supporting all four forms. The following code shows the differences between the four
forms as applied to the string ?¬?nal pi?±ata. The first word includes the ?¬? ligature,
which is compatibility equivalent (but not canonically equivalent) to the separated
characters fi. The second word includes the character ?±, which is both compatibility
equivalent and canonically equivalent to the code points n and ??.
$KCODE = 'u'
str = "?¬?nal pi?±ata".chars
str.normalize(:d).to_s # => "?¬?nal pin??ata"
str.normalize(:c).to_s # => "?¬?nal pi?±ata"
str.normalize(:kd).to_s # => "final pin??ata"
str.normalize(:kc).to_s # => "final pi?±ata"
Filtering UTF-8 Input
Although you may be UTF-8 clean through your entire system (UTF-8 text can be
entered anywhere and is displayed identically upon output), you are still at risk of
problems if you just accept user-provided strings as UTF-8. Users can provide invalid
UTF-8 text (not all byte sequences correspond to valid sequences of UTF-8 code
points). Users will even provide maliciously malformed UTF-8 text in an attempt to
crash or exploit your string-processing functions.


Pages:
359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383