Rails and Unicode | 245
Paul Battley wrote an article addressing the issue of filtering untrusted UTF-8 strings.*
As with most other hard problems in Rails, we cheat. In this case, the iconv library
can clean up UTF-8 strings for us:
require 'iconv'
ic = Iconv.new('UTF-8//IGNORE', 'UTF-8')
valid_string = ic.iconv(untrusted_string + ' ')[0..-2]
The Iconv.new line creates a new Iconv object to translate potentially invalid UTF-8
data into UTF-8 data with invalid characters ignored. The next line works around an
Iconv bug: it will not detect an invalid byte at the end of a string. Therefore, we add a
space (a known-valid byte) and chop it off after performing the conversion.
Ilya Grigorik shows how to use the Oniguruma regular expression engine to filter out
control characters (of the Cx classes).?? Note that the Oniguruma engine is standard
in Ruby 1.9, but is also available for Ruby 1.8 (gem install oniguruma).
require 'oniguruma'
# Finall all Cx category graphemes
reg = Oniguruma::ORegexp.new("\p{C}", {:encoding => Oniguruma::ENCODING_UTF8})
# Erase the Cx graphemes from our validated string
filtered_string = reg.gsub(validated_string, '')
Storing UTF-8
Proper i18n requires that your character set be correctly processed in the application
and correctly stored in the database. For most Rails applications, this means setting up
the database and connection to be UTF-8 clean.
Pages:
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384