Print Story Always use utf-8
By ucblockhead (Thu Apr 10, 2014 at 11:43:38 PM EST) (all tags)
Not sure why I'm posting something technical here.  Probably because I don't have a real blog.

I was reading a Stack Overflow discussion on Unicode this morning when someone put forth the conventional wisdom that utf-8 was good for Western languages while utf-16 was better for the rest of the world.  While reading the standard argument about utf-8 allowing a single byte for Western languages but requiring three for asian, I started wondering if this was a bit too simplistic, so I decided to actually check.

The reason the argument struck me as overly simplistic is because it assumes that a "character" is the same in each language.  But this is just not the case.  I've been knee-deep in localization for the last year, and have learned this the hard way, looking at Russian strings that flow like waterfalls down the screen while their Chinese equivalents barely take a line.

Since the argument is about localizing an application, I figured the best way to approach the subject would be to do some elementary school statistics on a localized application.  Fortunately, I have one.  This is a real application, used by millions of people internationally.  It's in production.  Localization has been done by localization "experts" well, not always and tested by localization testers.  I have reason to believe that the translations are good, so I figure this is a good a sample as any.

It's a little small.  This application has 542 strings, translated into 19 languages.  (14 if you allow for two variants of English, French, Spanish, Chinese and Portuguese.)  I have access to another application that has 1800 strings.  I get similar results, but the data isn't quite as clean, so I am using this.

So I took these strings, converted to utf-8, utf-16 and utf-32.  Here's what we have, sorted left to right, smallest files to biggest.

The first thing that should stand out is that the Asian locales have very small strings.  The two Chinese variants require a third the number of characters as the average Western language to express a particular thought.  That right there is most of the story.  Yes, a Chinese character in utf-8 takes three times the bytes as a Roman character, but a Chinese word takes just about the same number of bytes as its Western counterpart.  So unless the Chinese version of your application needs more memory, the encoding itself is a wash.

But there's some other interesting stuff:

When you hear the arguments against utf-8 for Asian languages, you get the impression that the space required for these characters will greatly increase.  But with actual strings, that is not the case.  Instead, the increase is marginal.  Not double, but more like 10-15%.  Why?

Well, because Asian countries use Latin characters.  Here's an actual string from the Chinese localization files: "%1年 12月,  ".  Roman characters are used for numbers, for some trademarked names and for links.  The Chinese file has the word "Facebook" in it.

This isn't Western arrogance.  The Japanese file is much the same, and working for a Japanese company, the Japanese text isn't translated, it's written from scratch.

So if you go to utf-16, the Chinese characters take less space, but all the Roman characters double in size, eating up a lot of the gains.

The most significant benefit for utf-8 is not Chinese, but Japanese:

This is because Japanese uses the phonetic Hiragana syllabary along with Chinese characters, and unfortunately these end up being three bytes each in utf-8. Even so, going to utf-16 from utf-8 only reduces memory usage by a quarter.

It turns out that the language that really blows file size out of the water in utf-8 is not an Asian language, but Russian:

This is because Cyrillic requires two-bytes per character, but because it's an alphabet, and the language is long-winded, it uses more raw characters than most other languages.  Then multiply by two.  But for Russian, because it's still two bytes per character in utf-16, and because again Roman characters show up here and there, utf-16 just makes things worse.

So TL;DR - use utf-8 for everything.

< Give me any beer I'll drink it | bleh >
Always use utf-8 | 16 comments (16 topical, 0 hidden) | Trackback
Pretty cool. by ana (4.00 / 1) #1 Fri Apr 11, 2014 at 09:12:19 AM EST
I have no application for it, but I enjoyed reading your analysis, and the linguistics behind the results. 

I now know what the noise that is usually spelled "lolwhut" sounds like. --Kellnerin

That was interesting! by Dr Thrustgood (4.00 / 2) #2 Fri Apr 11, 2014 at 09:46:23 AM EST
Thank you for writing this :-)

I've heard this. by dark nowhere (4.00 / 1) #3 Fri Apr 11, 2014 at 04:21:41 PM EST
I've heard a lot of things. The only thing I can tell for sure is more people end up using UTF-16 because they were told/forced to than people actually making the decision for themselves.

For me the problem isn't what's best (what's best is we pick a format that yadda yadda spoilers: UTF-8), but support for everything else is needed because it continues to exist. Here's some Wikipedia:

UTF-16 is used for text in the OS API in Microsoft Windows 2000/XP/2003/Vista/7/8/CE. Older Windows NT systems (prior to Windows 2000) only support UCS-2.

UTF-16 is used by the Qualcomm BREW operating systems; the .NET environments; and the Qt cross-platform graphical widget toolkit.

The Joliet file system, used in CD-ROM media, encodes file names using UCS-2BE (up to sixty-four Unicode characters per file name).

The Python language environment officially only uses UCS-2 internally since version 2.0 … Since Python 3.3, strings are stored in one of ASCII, UCS-2, or UTF-32, depending on which code points are in the string.

Java originally used UCS-2, and added UTF-16 supplementary character support in J2SE 5.0.

So I choose UTF-8 when I have a choice. A lot of the time you don't get a choice. It doesn't always mean you have to do non-UTF-8 operations, but it will still be there a lot of the time. I feel like this might be a self-perpetuating evil more than anything.

P.S. Rad graphs.

See you, space cowboy.

yeah by ucblockhead (2.00 / 0) #4 Fri Apr 11, 2014 at 07:42:38 PM EST
Truth is, it's not all that important to a non-embedded dev.  Our files are all in the 60k range where (IIRC) we have 1GB to play with.  We're using JavaScript, which seems to like UTF-8, which means they got at least one thing right.
[ucblockhead is] useless and subhuman
[ Parent ]
I hadn't considered embedded. by dark nowhere (2.00 / 0) #6 Fri Apr 11, 2014 at 08:57:19 PM EST
In the past I just did ASCII on small memory systems. It's not like I'd even have all the glyphs bitmapped anyway.

JS is a surprising choice—are you writing JS directly or transpiling? I guess node is all the rage now, but yeah I'm not too surprised its UTF-8 support is good. IIRC JSON specifies it. I do like JS fwiw... there is a great language in there if you can ignore/forbid the rest of the language.

See you, space cowboy.

[ Parent ]
JavaScript is taking over by ucblockhead (2.00 / 0) #8 Sat Apr 12, 2014 at 11:49:38 AM EST
Anything that has a browser ported to it has a JavaScript platform.  And honestly, it is much easier to find good JavaScript coders than good C++ coders.

I personally hate JavaScript.  Too many things that should be language features require libraries and the basic syntax is atrocious.
[ucblockhead is] useless and subhuman

[ Parent ]
Language features? by dark nowhere (2.00 / 0) #9 Sat Apr 12, 2014 at 12:55:27 PM EST
Where I come from they're best put in libraries if a) possible and b) seamlessly so. JS has a few warts that aren't amenable to that, but beyond scoping and parallelism I can't think of any. If those are your feature gripes, I agree. On the other hand, good HOFs and prototypes are magical, f/e katy.js, but C++ doesn't exactly prime you to see the value in it, or even write the kind of code where it would help.

Even so, the syntax is awful. I guess there's not much you can do about that if you're hiring JS devs and not specifically functional devs or whatever. If it was me, I'd happily write Roy (despite bullet #2) or something else instead. But I know you don't always get a choice.

See you, space cowboy.

[ Parent ]
When being able to import a module by Scrymarch (2.00 / 0) #11 Tue Apr 15, 2014 at 09:19:51 AM EST
Is itself a hacky "module" function, for which there is no clear standard, something about the balance is wrong.

Iambic Web Certified

[ Parent ]
That's a tricky subject. by dark nowhere (2.00 / 0) #12 Tue Apr 15, 2014 at 04:16:15 PM EST
It should be standardized and in the standard library, but not as a language built-in in my opinion.

But what do you mean by module? Do you want a loader to go with that? Packaging, versioning? To start with you basically have an object, and beyond that people can't all agree on what they want or what it should look like where they agree on what they want.

My take is once you've got two modules loaded they're inescapably going to look like objects, and they will play nice with one another even if they were packaged, versioned, loaded and verified by completely different approaches. The hard part is agreeing on how to version and where to load from and how to verify and so on. And everyone gets this wrong, at least at first.

Sorry about the earful, I seem to have opinions on this.

See you, space cowboy.

[ Parent ]
Yes and no by Scrymarch (2.00 / 0) #13 Fri Apr 18, 2014 at 10:16:36 AM EST
A module does indeed make sense to expose as a first class object, eg Python does this, but that doesn't change the need for a standard, consistent mechanism for referring to one. And package and version management, and loader variation, are very useful, but there's no need to muddle that in with the syntax capturing that dependency at the method call level.

Whether you end up with syntax like "import module.module" or import(module) doesn't matter so long as its a simple syntax to a standard mechanism.

Iambic Web Certified

[ Parent ]
I think I agree, by dark nowhere (2.00 / 0) #14 Fri Apr 18, 2014 at 02:39:32 PM EST
the first thing I said was the need for standardization. Perhaps there's a nuance I missed.

See you, space cowboy.

[ Parent ]
Also by Captain Tenille (4.00 / 1) #5 Fri Apr 11, 2014 at 08:14:05 PM EST
Many files with non-Latin Unicode characters will still end up being smaller in utf8 compared to utf16 if they aren't raw text. HTML files, LaTeX documents, etc. will have lots of ASCII characters in them for tags and the like, so even though the characters of the text are bigger in utf8 the files ultimately end up smaller because of all the characters that end up taking one byte.


/* You are not expected to understand this. */

+1 VSTFP by R343L (4.00 / 1) #7 Sat Apr 12, 2014 at 12:57:46 AM EST
Do we do that still?

Anyway, nice empirical story. I tend to use utf-8 whenever I can simply because it's simpler. It just doesn't matter what kind of language I'm storing. It will probably just work and when it doesn't it will generally be obvious. Plus language support is pretty ubiquitous at this point.

I actually helped debug a weird encoding problem last summer between a legacy application and the replacement application stack where the individual hex bytes of the UTF-8 encoded values were being treated as unicode points. It was really obvious to spot bad data after a while because there are patterns to what two byte characters look like when their utf-8 is printed as hex.

"There will be time, there will be time / To prepare a face to meet the faces that you meet." -- Eliot

VS2FP OMFG UTF-8 FTW! by TheophileEscargot (2.00 / 0) #10 Sun Apr 13, 2014 at 04:41:48 AM EST

It is unlikely that the good of a snail should reside in its shell: so is it likely that the good of a man should?
but who cares how big your language files are? by the mariner (2.00 / 0) #15 Fri Apr 18, 2014 at 05:01:05 PM EST
are you developing for bill and ted's atari home computers or something?

Transfer speeds by ucblockhead (4.00 / 1) #16 Sat Apr 19, 2014 at 07:40:40 PM EST
The application is delivered over the Internet.  You want to minimize that.

But know, it's not a real issue.  I was just saying that the conventional wisdom about UTF-8 and Asian characters is wrong.
[ucblockhead is] useless and subhuman

[ Parent ]
Always use utf-8 | 16 comments (16 topical, 0 hidden) | Trackback