Prev | Current Page 372 | Next

Brad Ediger

"Advanced Rails"


Rails and Unicode | 245
Paul Battley wrote an article addressing the issue of filtering untrusted UTF-8 strings.*
As with most other hard problems in Rails, we cheat. In this case, the iconv library
can clean up UTF-8 strings for us:
require 'iconv'
ic = Iconv.new('UTF-8//IGNORE', 'UTF-8')
valid_string = ic.iconv(untrusted_string + ' ')[0..-2]
The Iconv.new line creates a new Iconv object to translate potentially invalid UTF-8
data into UTF-8 data with invalid characters ignored. The next line works around an
Iconv bug: it will not detect an invalid byte at the end of a string. Therefore, we add a
space (a known-valid byte) and chop it off after performing the conversion.
Ilya Grigorik shows how to use the Oniguruma regular expression engine to filter out
control characters (of the Cx classes).??  Note that the Oniguruma engine is standard
in Ruby 1.9, but is also available for Ruby 1.8 (gem install oniguruma).
require 'oniguruma'
# Finall all Cx category graphemes
reg = Oniguruma::ORegexp.new("\p{C}", {:encoding => Oniguruma::ENCODING_UTF8})
# Erase the Cx graphemes from our validated string
filtered_string = reg.gsub(validated_string, '')
Storing UTF-8
Proper i18n requires that your character set be correctly processed in the application
and correctly stored in the database. For most Rails applications, this means setting up
the database and connection to be UTF-8 clean.


Pages:
360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384