Similarly, you can find the previous character
by only working backward.
??? Because of these unique prefixes, no encoding of a character is a substring of
another character??™s encoding. For example, the ASCII character ???a??? is represented
by 0x61 in UTF-8. No other character??™s encoding will contain the byte
0x61, so if you see that byte, you know that it represents the character ???a.??? This
ingenious design decision means that string searching works with standard, non-
UTF-8-aware algorithms.
However, UTF-8??™s similarity to previous encodings can lead to confusion. When
working with UTF-8 text, there are more things to think about:
??? The number of code points in a string cannot be determined from the number of
bytes. The entire string must be read and processed to determine the number
of characters.
??? Even when the number of code points is known, features such as ligatures, combining
characters, bidi text, and control characters make it impossible to determine
how much space is needed to display a string without parsing every byte.
??? UTF-8 strings cannot be cut at byte boundaries; they must be cut on character
boundaries. Due to the design of UTF-8, it is easy to find character boundaries
with simple bit operations, but this must still be taken into account.
UTF-8 has largely won out over other encodings, especially on the Internet. Later in
this chapter, we will examine the problems encountered when working with UTF-8
text in Rails, and we will look at the solutions we have available.
Pages:
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376