Random access in strings is a myth

Whenever a discussion of Unicode encodings comes up, someone eventually claims that UTF-32 has an advantage of random character access and character-by-character manipulation.

Guys, there is no such thing as random access to characters or character-by-character manipulation in real-life Unicode text.

UTF-32 gives random access to code points, which is entirely different thing than characters. Here are things that you might think are easier in UTF-32 than in UTF-8, but in fact they are not:

The random access to code points, which makes UTF-8 and UTF-16 different from UTF-32, is rarely (if ever) useful. The random access to characters does not exist. Additionally, UTF-16 and UTF-32 have the disadvantages of wasted memory and endianess conversions.

I believe the only reason wchars are still alive are that Windows NT, Java, and the ICU library made the wrong call of picking UCS-2 once, and now they are stuck with backward compatibility. Thinking more of it, today a freshly designed system should have:

comments powered by Disqus