Version texte.

So-called “localization professionals” like to point out how complex and difficult internationalization and localization is. I'll let you into a secret: it isn't. It's actually quite easy once you've got around a couple of basic concepts. There are only two mistakes that can make internationalization and localization a real burden: when you think too much about it and when you think too little about it.

Paradoxically, the comma-delimited file format csv, originally devised by Microsoft, is a typical example for both of these errors, at least in its variety as output by Microsoft Office and a lot of third party applications.

MS Office is over-localized

It can be argued that the Office, especially Excel, interface is way over-localized. To give you an example, formulas in Excel use locale-specific keywords. I guess it's supposed to make our lives easier, but it doesn't. Say you work abroad, but you brought your own notebook with you – you'll always have to remember what machine you're working on and to what locale it is set. Even worse, when you look up information on the Internet, you cannot use, say, an English tutorial or forum post, but you'll have to find the equivalent in your language.

The lesson is: you should not localize everything. It's for a reason that scientific symbols or programming languages are international and it was actually a huge step forward to stop using local units of measurements and to switch to international ones. Sometimes it seems to me that with all the localization and internationalization frenzy going on in a lot of corporations, we are going back to the time when you couldn't be sure that a pound of grain was still the same amount in the next village.

But luckily it's just the interface…

One thing they got right in MS Office, however, is abstraction from the interface. There is the concept of the decimal point – and if you open the thing in America, it's a point, if you open it in France it's a comma. There's the concept of a date, and you can display it in different formats for different locales. That's good design…

…except if you export to csv…

Great. So when we save a spreadsheet as csv, the numbers are certainly saved in some internationally compatible format? Wrong. All those locale-specific decimal points and so on are hard-coded. Yes, hard-coded. At some point, they must have discovered that, oh, that's a problem because a lot of languages use a comma as a decimal delimiter. Of course, we can enclose a string in double-quotes to prevent the commas it may contain from being treated as delimiters, but do we want to enclose each and single number in double-quotes? No…

…so they made the delimiter locale-dependent

At some point in every totally flawed endeavour there is a moment of hesitation at which people have to decide whether either to recognize that their entire strategy is completely misguided or to decide blindly to follow it through down to its most absurd and horrible consequences. With csv, they chose the second path. When I create a csv with my German Office, I get semicolons as delimiters. It's the same with a French Office.

Wait, what was the purpose of csv again?

The traditional purpose of text-based formats is to

It can be argued that with the growing importance of the network, the second purpose is far more important than the first one. However, by its locale-specific nature, csv is totally inapt to be used for it.

A locale-specific file format?

Therefore, csv is actually a locale specific file format. How crazy is that? Imagine if the characters that enclose a tag in XML were locale-specific!

Then, it has its own concept of escaping (which is far more complex than traditional escaping where you escape characters that have special meanings with a backslash). If a cell contains the (locale-specific!) delimiter, it is enclosed in double-quotes. If a cell contains a double-quote, it is enclosed in double-quotes and the literal double-quote is doubled ("Hi", he said. becomes """Hi"", he said.","" becomes """""").

What were they thinking?!

Summing it up

This goes to show how at first they thought too much about localization: In an attempt to be overly locale-conscious, they even made the csv format locale-dependent. At the same time, they thought too little about it in that they failed to see that a file format should be universal, even more so if it has been devised to facilitate exchanging information.

Retour au début de la page

Catégories: Localization and Internationalization

Mots clés/tags:

| Commentaires (1) | Rétroliens (0)

Rétroliens

URL rétrolien :
http://christianflury.com/cgi-bin/mt/mt-tb.cgi/18

Commentaires

nice rant!
(and useful to have a site which mentions the use of a ; csv separator in locales with a , as a decimal)

De james roberts | Wednesday, 07 March, 2007

Ici vous pouvez ajouter un commentaire

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)

Christian Flury

World 0.1

Lien au flux RSS de mon blog.