Random access in strings is a mythNov 20, 2012 · 3 minute read · Comments
Whenever a discussion of Unicode encodings comes up, someone eventually claims that UTF-32 has an advantage of random character access and character-by-character manipulation.
Guys, there is no such thing as random access to characters or character-by-character manipulation in real-life Unicode text.
UTF-32 gives random access to code points, which is entirely different thing than characters. Here are things that you might think are easier in UTF-32 than in UTF-8, but in fact they are not:
Manipulating case. You can not take a
str[i], change it’s case and put it back into
str[i]. Some code points capitalize to several ones (the German letter ß is capitalized as “SS”; some ligatures like ﬀ or ﬁ do not have uppercase and become FF anf FI). The Greek letter Σ has two different lowercase forms: “ς” and “σ” depending on context.
Counting characters. Several code points may produce only one character (diacritics, devanagari scripts), and one code point should be treated as two (ligatures). One string can have two times more code points but be shorter than another. Unicode is ultimately variable-length thing.
Drawing or interaction. You can not draw code points individually for the same reasons. Mouse clicks on a displayed glyph may pick a range of code points, not just one index. Advancing text cursor in response to right arrow
→key is much more complicated than incrementing a position number.
Searching. Searching for even an ASCII character like
'e'by looking for its code point may falsely succeed when it’s preceeded with a diacritic. And it will fail to find
'f'when it’s composed into a ligature
Comparison. The only comparison you can do by looking at individually accessed code points is equivalent to
memcmp. For example, in many lanaguages users want characters with diacritics to match precomposed equivalents.
Ordering and sorting. Even “plain” U.S. English text that you would hope to be restricted to ASCII will contain fancy characters (think Beyoncé) and therefore comparing code points will not give you proper alphabetic ordering.
The random access to code points, which makes UTF-8 and UTF-16 different from UTF-32, is rarely (if ever) useful. The random access to characters does not exist. Additionally, UTF-16 and UTF-32 have the disadvantages of wasted memory and endianess conversions.
I believe the only reason
wchars are still alive are that Windows NT, Java, and the ICU library
made the wrong call of picking UCS-2 once, and now they are stuck with backward compatibility.
Thinking more of it, today a freshly designed system should have:
operator for objects of
Stringin the first place
Stringas a thin wrapper over u8/u16/u32 array to be able to take over any data without re-encoding
- UTF-8 as the default choice