Text-only version.

Guys – and by guys, I mean the developers of about 90% of the emailing routines behind the avalanche of friendly emails from one of my mobile phone carriers, some well-known on-line retailer (don't remember whether it was ebay or Amazon) and a bunch of marketing communication from different shops, corporations some of which don't even have the excuse that they primarily target a pure-ASCII American public – guys, I've got a nice Conway quote for you:

[...] if you're ready to concede that ASCII-centrism is a naïve façade that's gradually fading into Götterdämmerung, you might choose to bid it adiós and open your regexes to the full Unicode smörgåsbord [...].

Well, actually that's the part which, as it appears, most people seem to have got by now (provided that they check user input at all, that is). However, it also seems “rather déclassé for an überhacking rōnin”, to quote Conway once again, to first ambitiously open up their regexes to the world and then leave those poor Unicode characters to their fate when emailing them in plain text over a 7bit ASCII mail server. Ever heard of base64? Or Quoted Printable encoding for that matter?

Sending plain-text Unicode email

What startles me about the amount of email with corrupted non-ASCII characters I've been getting lately is that it would be so easy to send Unicode email in an encoding that almost all email clients can read.

The issue is that email is still being transferred as ASCII, often as 7bit ASCII. So when accented or foreign characters have to be transferred (or even an attached file), these have to be encoded in some way. In other words, if you use an encoding other than ASCII for the content that you would like to transfer – and you should – you do not only have to specify the content encoding, but also a content transfer encoding that has to be able to be represented in pure ASCII. The most obvious strategy is to say: What is the biggest power of 2 that can be represented by printable 7bit ASCII characters? It's not 128, but 64, because not all ASCII characters are printable. Therefore, at first sight, it would seem best to take the binary values of the data you want to transfer and express it in base 64. This is indeed how attachments are transferred, and it's the most efficient way of transferring email in languages that hardly use ASCII characters at all.

This approach, however, would create a huge overhead when in reality most of your content consists of ASCII characters anyway. In this case, one usually uses Quoted-printable: In this encoding, only non-ASCII characters are base64-encoded and marked as such. It is the best solution for most Western European languages (in fact, it's the encoding that I use if you send me an email from this website except for the headers where I use base64). Furthermore, it comes with the added benefit that even the source remains human-readable in most languages that use an extended version of the Latin alphabet.

Things are a bit (but just a bit) more complex when it comes to the headers – to learn more, I would refer you to the excellent article on Wikipedia.

It's even easier

What's nice is that you don't even have to think about these technicalities since most programming languages have libraries for these tasks. For example, in Perl, provided you specify the correct headers it's just a matter of useing the MIME::Base64 and the MIME::QuotedPrint module and using the exported subroutines to convert your content before sending it. (It might be even easier with an emailing module – on this website, I am just using sendmail.)

Therefore, it seems funny to me that I am still getting so many corrupted email messages from companies that, honestly, should know better. You might say it's not a big deal because you're still able to read them, but still...

One odd observation

By the way, has anyone else noticed that Outlook puts the sender's name in double quotes when it is transferred in base64? (It's only Outlook, no other client that I am familiar with.) Why is that?

Now that I think about it...

Now, that I think about it, some of these corrupted email messages even were HTML – seems like some mailing programs have encoding issues in general.

Back to the top of this page

Categories: Localization and Internationalization Web Development and Programming

Keywords/tags:

| Comments (0) | Trackbacks (1)

Trackbacks

Trackback URL for this entry:
http://christianflury.com/cgi-bin/mt/mt-tb.cgi/24

Quite Some Characters: A Unicode Primer for Linguists from World 0.1
For people who are part of a creation process as is the case for us linguists it often proves helpful to have at least a basic idea of the other aspects involved in that process even if, strictly speaking, they are outside our own responsibility. Since...
» Read more
Tracked on March 19, 2007 01:47 PM

Post a comment

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)

Christian Flury

World 0.1

Link to the RSS feed for my blog.