Print Story A Day in the Life
By ReallyEvilCanine (Thu Jul 20, 2006 at 03:38:03 AM EST) A Day in the Life, WTF, Unicode, Swearing in Arabic, pie (all tags)
åÐÇ ãäíß

There are worse things than the 17s I deal with on a daily basis. Like Clever admins. Creative admins. Clever and creative admins who, like me, do whatever the hell they have to in order to get a system up and running. Mark from $CompCorp is one such admin; his cleverness resulted in what appeared to be a complete database corruption.

Luckily it was caught in user acceptance testing on a copy of the working database.

x-posted to the blog, sans poll.

$CompCorp were seeing corruptions all over the place. Where Arabic letters should have been were, instead, lots of screwy strings like the title of this entry which meant Unicode was involved. I asked for screenshots, dumps and exports. They'd escalated this internally to their VP before bothering to submit the ticket so that as soon as I took it, Mr VeePee escalated it here with us. Despite sending answers within hours of receiving their updates I still got jumped on by their people and ours insisting I wasn't working fast enough and that my solutions sucked.

I got very lucky with additional info they sent me: the display in $OurBigApp had multiple lines in Arabic and English and only some of these were corrupt. I finally had my good and bad pieces. Better still, I knew the primary keys for these rows. "Please send me the results of "SELECT * PriKey, Desc FROM WHERE PriKey IN ($nn-foo, nn-bar);". It took them a day to get back to me but that didn't stop their managers howling Monday afternoon and screaming for my head.

When the results came back Tuesday I was first confused. Where text appeared corrupt in the application, it looked perfect in the SQL client, but where it was fine in the application, it came out as garbage in the SQL Client The latter is normal since the client isn't Unicode-compliant and sure enough, changing their codepage got the data to display correctly on SQL dumps.

I talked about the problem with the other two guys in I18N who might know but all we could think of was "fonts", though this couldn't be it since I had both good and bad displaying in all circumstances reproducibly. The my-head-shaped dent in my desk grew ever so slightly.

Then it finally clicked. I called and asked if some of the data had been imported. They bitched about how long it was taking before finally saying that much data had been. And no, they hadn't noticed that only imported data appeared corrupt. I got the specs on their database and didn't know whether to laugh or cry.

Their admin had been clever. Very clever. We only supported codepage 1252 or ISO8859P1 but these are only for Western European characters, not Arabic which uses either 1256 or ISO8859P6. Nevertheless the admin managed to get the $CompCorp system running with Arabic text thanks to Windows' "helpfulness". That's fine as long as you're isolated and insulated. Moving to a Unicode database was a good move but broke the insulation.

Hi, When you typed the letter "thal" (character 0xD0) what was saved was actually an Icelandic "eth" (ð). You then moved this data into a Unicode database. Translation was done during the move from Western European -- not Arabic -- so that the raw data for the word "green" (spelled: seen, beh, zain) was seen by the system as 0xD3, 0xC8, 0xD2. In the 8859P1 code page these three characters are "ÓÂÒ". This is how the corruption took place.

The reason you saw the data "correctly" was that Windows converted the characters to Arabic based on the codepage you were using on the clients, ignoring the database and the indication that these were Western European characters.

I sent a copy of the solution around to our I18N people and responses were along the lines of "Holy shit." That their database isn't completely corrupted due to Windows' internal use of Unicode amazes us all.

Two days later and still no update. No thank-you. No confirmation. Nothing. I expect they'll also slam me on the survey for having taken too long to solve a problem which by their own admission, my cow-orkers never would've seen.

The title of this entry is corrupted just like $CompCorp's data. It should read هذا منيك (Hetha Mnäyik): "This is bullshit!"

< Cooking Lesson #1 | BBC White season: 'Rivers of Blood' >
A Day in the Life | 4 comments (4 topical, 0 hidden) | Trackback
Oh, all right then, have a comment by Rogerborg (4.00 / 1) #1 Thu Jul 20, 2006 at 08:10:38 AM EST
I admire people who deal with localisation[1], in much the same way that I admire sewer cleaners; a tacit respect for the keeping the rest of from drowning in unwanted shit, vended from a distance.

[1] I18n is bass ackwards.  If you're thinking in terms of going from local to "international", then you have already lost.

Metus amatores matrum compescit, non clementia.

This is me biting by ReallyEvilCanine (2.00 / 0) #2 Thu Jul 20, 2006 at 10:25:21 AM EST
Localisation is taking an app designed to work somewhere else and making it work here where people haev a different first day of the week, a different numbering system, different number dividers, different date orders, etc. How is it losing when you make an app work in a way the locals who don't speak Heathenforeignerish can understand?

the internet: amplifier of stupidity -- discordia

[ Parent ]
This is me on my hobby horse by Rogerborg (2.00 / 0) #3 Thu Jul 20, 2006 at 11:02:13 AM EST
I shall elucidate:
  • Localisation is taking an app designed to work anywhere and making it work somewhere.  Localisation is good.
  • Internationalisation is taking an app designed to work here and trying to make it work there.  Internationalisation is too late.
It's a semantic distinction, but as language is the crux of the matter, I feel that it's an important one.  I guess you can call it what you want (if you don't believe that language influences attitudes), but if you find yourself working on a system with a concept of a "default" language or (worse, IMO), "English" or "en", then there's going to be some painful learning curves ahead.

Metus amatores matrum compescit, non clementia.
[ Parent ]
Must. Not. Agree. With. Borg. by ReallyEvilCanine (4.00 / 1) #4 Thu Jul 20, 2006 at 12:43:15 PM EST
Whoops. I was supposed to defend I18N, not L10N. So we're cool on L10N then.

The primary reasons for I18N are shitty codepages and untraveled coders. We turned our app -- which needed a different build for each language only five years ago -- into an app which works in any language, even Hebrew and Arabic despite the difficulties and OS defects (Windoze and *nix) in supporting bi-di.

When the apps were first created, they were for an American market. They turned out to be good enough to go global. Then an I18N team was formed and we found al the problems and showed the programmers everything they had to look out for. Who knew that the Spaniards would get so touchy about a calendar showing Sunday as the first day of the week? That ain't anything that most programers concern themselves with. Except that the Microsoft calendar APIs themselves had no means to modify the display order.

Unicode is still a fractioned group with some still (wrongly) extolling the virtues of shitty UCS-2. Only recently was the right decision made to use a fourth character rather than surrogate pairs for additional display. How do you provide software that works everywhere when the infrastructure itself hasn't been built?

There are so many variants of Arabic that we've almost run out of letters to hang on the digraphs. There are so many geeks with their little hang-ups that we now have to deal with Cuneiform! Yes, Cuneiform was added in the Unicode 5 spec. While I rather doubt that any of our customers are going to use a writing method designed for sticks and clay tablets, we do have them increasing their reach to even more remote areas and we have to support all sorts of weird squiggly lines. We do that now. It's a lot less painful to support these days, all because of I18N.

In an ideal world the unwashed hippie coders would know things about the world beyond their cubicles and outside a 3-block radius of their flats but we don't pay 'em eough to ever actually get out there and see it.

the internet: amplifier of stupidity -- discordia

[ Parent ]
A Day in the Life | 4 comments (4 topical, 0 hidden) | Trackback