Print Story A Day in the Life
Dear Japan,

Your clocks are running seven hours too fast. PLZFIXKTHXBYE!

I like my Japanese cow-orkers. I really do. Of course, I've never had any "face-time" with them which might explain this lack of animosity. But when I need to work with them I either have to be up at 3am (and sober enough to function) or I might as well send snail mail. One round-trip communication takes three days.

Poll: LOL creashunistas
x-posted to da brog.

I needed a database in Shift-JIS, the most common Japanese encoding. It's crap compared to Unicode... hell, it's crap compared to anycode, it being a freaky Microsoft hack to enforce their idea of codepages and still work with previous Japanese standards like JIS X 0201 and 0208. Wacky stuff if you're one of the couple dozen codepage supergeeks. I know I'm lame. Haz a cat. In fact, take two; they're small.

So what the hell do I want with a Shift-JIS DB when its suckage quotient is so high? It seems we have a bug, one that I not only pointed out about eight fucking years ago but which also should've been dealt with by the time $OurBigApp supported Unicode.

ATTENTION AMERICAN DATABASE-PROGRAMMING INFIDELS: There is a huge fucking difference between "character" and "byte". Not for you normally, but for most of the rest of the fucking world. One byte per character works fine for Englsih. ASCII is also sufficient for Latin, Swahili and Hawaiian. It is rumoured that there are other languages, many of which have more characters than can be addressed with a single byte.

It turns out a field length of 5,000 characters isn't actually 5,000 characters but 5,000 bytes. For the Japanese this means that they can only squeeze in around 2,200 characters, not quite enough for what this field is designed to contain. But only in UTF-8. In Shift-JIS and UTF-16 with their fucking surrogate pairs the number becomes even more grim -- around 1600 characters.

So why didn't I just install a fucking Shift-JIS database on my own if I'm such a Mr Smarty-Pants? Setting up the DB is easy but our installer which adds and shapes the schema sucks. It's overly complicated (more than 90 screens of text and clicky goodness). That alone isn't a problem. I don't speak or read Chinese and I can still not only install but administer Windows in Chinese, both Traditional and Standard. Microsoft sucks but at least their suckage is uniform across languages. Same dialogs, same layout, same buttons, same icons. Not so $OurBigApp. The Japanese installer is nothing like the English which is nothing like the German, so I can't even run a side-by-side installation and select the correct radio buttons or fill in the proper fields.

I can read some Japanese but with so little chance to use the language I've lost much of it over the past 12 years. A few smrt peepul might think, "Duh! Just select the dialog text, copy and then paste it into Teh Ghugel Translator!" Yeah, I thought of that. Our programmers had different ideas:


< In all this | I R MASTAH CHEPH! >
A Day in the Life | 9 comments (9 topical, 0 hidden) | Trackback
u r wrong. by hulver (4.00 / 5) #1 Fri Jun 27, 2008 at 04:05:51 AM EST
Creating a label in a dialog box that can be copied and pasted takes more effort than one that can't, and is actually quite unusual. It also uses more resources, as each control that can be copied and pasted needs to have a window handle. Whereas controls that don't (like standard label controls) don't need to have one.

So it's not that your programmers have turned something off.
Cheese is not a hat. - clock

That's YOUR reality, man! by ReallyEvilCanine (4.00 / 5) #2 Fri Jun 27, 2008 at 04:19:37 AM EST
Don't go fucking up a good rant with your girlie-man "facts".

the internet: amplifier of stupidity -- discordia

[ Parent ]
I thought by yicky yacky (2.00 / 0) #3 Fri Jun 27, 2008 at 04:28:52 AM EST

pretty much all the Japanese character sets were inside the BMP, so no UTF-16 surrogate pairs would be used. This would give you numBytes / 2 characters for every numBytes-quantified field (2,500 in the case of a 5000-byte field, not 1,600 as claimed).

In addition, many (most? all?) databases these days can use unicode natively, in which case characters == numCharacters. Isn't creating a Shift-JIS DB entirely the wrong solution here? Can't you just reconstruct the DB with UCS-2 as the default character encoding for text fields and then convert the Shift-JIS characters on the way in?

Vacuity abhors a vacuum.
There are 40K Unihan on Plane 2 by ReallyEvilCanine (4.00 / 1) #4 Fri Jun 27, 2008 at 04:58:16 AM EST
Many are obscure but many are names. SQL Server 2008 still doesn't support UTF-8.

the internet: amplifier of stupidity -- discordia

[ Parent ]
SQL server 2008 by TPD (2.00 / 0) #5 Fri Jun 27, 2008 at 05:50:23 AM EST

This is not the first time I have sworn about the lack of utf8 in SQL Server (and it most likely won't be the last)

why sit, when you can sit and swivel with The Ab-SwivellerTM

[ Parent ]
Point-n-click has its cost by ReallyEvilCanine (4.00 / 1) #6 Fri Jun 27, 2008 at 06:06:27 AM EST
SQL Server isn't that bad, surprisingly. But to continue to refuse to support UTF-8 simply because MS gambled on a version of Unicode and got it wrong is just plain stupid. It would make things so much easier if they'd move Win7 to UTF-8 internally as well. And use the stupidly optional BOMs everywhere. Yes, I really could go on for hours about this sort of shit.

the internet: amplifier of stupidity -- discordia

[ Parent ]
None by yicky yacky (2.00 / 0) #7 Fri Jun 27, 2008 at 06:51:36 AM EST

of those Unihan are expressable in any of the Shift-JIS variants, either. SQL Server handles UCS-2 just fine (see the ntext* field definitions), which is a superset containing all the Shift-JISii.

Vacuity abhors a vacuum.
[ Parent ]
Shift-JIS by ucblockhead (4.00 / 2) #8 Fri Jun 27, 2008 at 07:00:49 AM EST
I could tell you a story about a certain group that discovered that a certain file format didn't have Shift-JIS but did have UTF-8 and so came up with the brilliant solution of simply putting the Shift-JIS data in the UTF-8 field and relying on the programmer to know that it wasn't really UTF-8.
[ucblockhead is] useless and subhuman
I want to laugh at this by ReallyEvilCanine (4.00 / 1) #9 Fri Jun 27, 2008 at 08:49:48 AM EST
But I don't think I'd ever be able to stop.

the internet: amplifier of stupidity -- discordia

[ Parent ]
A Day in the Life | 9 comments (9 topical, 0 hidden) | Trackback