For people who are part of a creation process as is the case for us linguists it often proves helpful to have at least a basic idea of the other aspects involved in that process even if, strictly speaking, they are outside our own responsibility. Since computers have become such an integral part of our lives and working environments, computer-related aspects of multilingual content should be of interest to anyone involved in a translation process. This knowledge may not be directly required for our linguistic work, but it can greatly facilitate collaboration and communication with engineers, graphic designers, etc. and make it easier to identify and analyse obstacles that we may encounter from time to time. Yes, you guessed it right, this article is dedicated to the horribly dull and nasty topic of text encodings, in particular, Unicode.
Since this article has grown somewhat long, I have broken it down into four main chapters:
- How do computers deal with text?
- Regional character sets and the Unicode standard
- Practical guidelines for trouble-shooting
- Translating the term “encoding”
- Further reading
How do computers deal with text?
What's a character?
Sounds like an easy question to ask, doesn't it? Is it not just a visual sign used in combination with other visual signs to represent natural language in written form? Or the smallest unit of a writing system?
In order better to understand how text is processed by computers, I would encourage linguists to think differently and make a clear distinction between a character (an abstract atomic unit in language notation) and a glyph (the visual representation thereof). At university, we have all learnt about the semiotic triangle and how to differentiate between concept (the idea of an abstract or actual object) and term (the word or phrase used to refer to that idea). By analogy, you could think of a character as some sort of concept and of a glyph as a visual “term” used to express that concept. Systems that link a given set of glyphs to the characters they express are known as fonts.
While these considerations might sound a bit convoluted and academic, their practical implications can be of great help in trying to troubleshoot encoding issues as we will see later on.
Character sets and text encodings
As we all know, at the lowest level, computers only store ones and zeros. This means that in order to store written language, the series of characters of which a text consists need to be expressed in ones and zeros. This is where character sets and their encodings come into play.
The first step in expressing characters as bits and bytes is to assign each character a unique positive integer (number), referred to as that given character's code point. These assignments are defined in a character set or code page. For example, in most character sets the Latin capital letter C corresponds to the numeric value 67.
The second step, put simply, consists in representing this code point (i.e. this number) as ones and zeros using a logic that makes it unambiguous where one character ends and the next one begins and that respects the limitations and conventions of the system's storage capabilities. This process is referred to as encoding.
Confused?
Before we go on, let us sum up what we have discussed so far in a simple diagram.
In other words, we have the abstract concept of a character. This character is presented to the human eye using a corresponding glyph (i.e. visual representation) provided by the used font. In order to be processed and stored by computers, that same character is assigned a number (code point) according to a character set or code page. Eventually, this number has to be represented in bits and bytes in an unambiguous way (encoding).
Regional character sets and the Unicode standard
ASCII
The first widely-adopted character set and encoding was ASCII, sometimes referred to as US-ASCII. It was a 7-bit character set and therefore offered room for 128 (2 to the power of 7) characters, basically those that you would find on the keyboard of an English typewriter. Since computers like to organize information in units of 8 bits (bytes or, strictly speaking, octets), each character usually corresponded to one octet whose first bit was set to zero.
Regionally specific encodings
ASCII soon proved to be insufficient for international purposes because of the lack of support for diacritical marks or special characters, let alone Asian or otherwise exotic scripts (writing systems). This resulted in the advent of a wealth of encoding standards which extended ASCII, offering larger character repertoires in order to accommodate languages other than English, as for example the ISO standard 8859-1 (often referred to as “Western European” encoding).
While for most languages that make predominant use of the Latin script along with just a few special characters it was sufficient to resort to using the first (unused) bit of each octet so that one character still corresponded to one byte, a lot of non-European languages required a larger character repertoire and needed two bytes per character; hence the expression “double-byte languages” that is still heard from time to time in the localization industry to refer to (mostly Asian) languages with huge character sets.
Unicode
You may have guessed the problems that arose with the adoption of these regional encoding standards: They complicated the task of dealing with multilingual content and they made the process of localizing software and documents into languages that used different character sets complex, cumbersome and bug-prone.
The basic idea behind the Unicode standard (which emerged in the nineties) was to introduce encodings capable of accommodating a character repertoire (i.e. collection of characters) that would cover every thinkable language and script (writing system) in the world. (I shall spare you the complicated challenges that accompanied its evolution – you may want to look up UCS-2 and UCS-4, combined characters or Han unification if you are interested in the gory details.) Unicode defines a character set (UCS – Universal Character Set) designed for that purpose. The code points were assigned with backwards compatibility in mind – its first 256 code points are identical to ISO 8859-1 (Western European) whose first 128 code points, in turn, are identical to ASCII. For other scripts, the ordering of regional encodings was preserved to make re-encoding of existing content easier.
It is important to note that Unicode's Universal Character Set is exactly that: a character set, not an encoding. This means it assigns numbers to characters, but does not impose one particular encoding of that character set (i.e. a representation of the code points in bytes and bits). Previously, most regional character sets had also defined one particular encoding which meant that distinction between character set or code page and encoding was not as important as it became with Unicode.
U+WTF?
When reading about Unicode or dealing with engineers, you may often come across notations such as U+004B to refer to a given character. The “U+” indicates that we are referring to a Unicode code point while what follows is the number assigned to the character in question in hexadecimal notation. For example, U+004B refers to Unicode character number 75 (4 x 16 + 11) which is the Latin capital letter K.
In computer science, hexadecimal notation is often preferred to decimal notation because, unlike 16 which is the base of the hexadecimal system, the base of the decimal system – 10 – is not a power of two. Moreover a two-digit hexadecimal number can have 256 different values – exactly as many as a byte (two to the power of eight since a byte has eight bits).
Unicode encodings: UTF-32, UTF-16 and UTF-8
The main decision to take with regards to the encoding of the Unicode character set regards the trade-off between space requirements and complexity.
UTF-32 is a fixed-width encoding. This means every single character needs exactly 32 ones and zeros (4 bytes). This is simple, but creates a huge overhead when you are mostly dealing with Western characters. UTF-16 and UTF-8 are variable-width encodings: they require less storage for very common characters than for rare characters, but they are more complex. Furthermore, for text that only contains ASCII characters, UTF-8 is byte-identical to ASCII which increases backwards compatibility.
Endianness – does your computer “think” in English or in German?
In the context of Unicode encodings, you will sometimes come across the terms “big-endian” or “little-endian”. They refer to a low-level technicality that we should not really care about as non-techies. Just to explain the concept for the sake of completeness: You probably know that, instead of “forty-two”, in German we would say “two-and-forty” (“zweiundvierzig”). When ordering the bytes in a multi-byte character, some computer systems think in English, others in German. This is basically what endianness means. Luckily, so far no computer system has been invented that would think in French, thus encoding its bytes along the lines of “quatre-vingt-quatorze” (“four times twenty and fourteen” = 94).
BOM or Unicode signature
To disambiguate its endianness or byte order, a file may contain a BOM (byte-order mark) as its first character. This “invisible” character has the code-point U+FEFF and, in principle, serves the sole purpose of defining the byte order or endianness of that file. However, it is also used as a marker indicating that the given file is encoded in UTF-8, UTF-16 or UTF-32, hence the alternative term Unicode signature.
Misuse of the term “Unicode” as a synonym for UTF-16
Since this entire topic is not quite confusing enough as it is, at some point, someone (it is rumoured that it was Microsoft) started using “Unicode” as a synonym for UTF-16. As we know, this is nonsense: Unicode is a standard that defines a character set which may be encoded using several encoding systems one of which is UTF-16. Unfortunately the illogical use of “Unicode” as a synonym for “UTF-16” has become widespread in text processing and other computer applications. This adds to the general confusion about encoding issues.
Practical guidelines for trouble-shooting
Character corruption
Today, the different Unicode encodings (in particular, UTF-8 and UTF-16 with or without BOM) enjoy widespread adoption, but some applications, database systems, frameworks and scripting languages are still lacking full support for them and rather use regional encodings. This means that when data is exchanged between applications or systems, text may have to be re-encoded in the background. Sometimes this goes wrong which causes character corruption.
This is why from time to time, when you find your elaborate and well-formatted translation published, you might notice in utter horror that special characters or special typographic marks have been lost or represented badly. While this is particularly common in e-mail communication (where, in addition, the problem of transfer-encoding comes into play), it may happen in a lot of other cases as well, and you don't want to see the result of your hard work mutilated.
Trouble-shooting
I have often seen people manually re-enter or correct text that had been damaged. This is mostly unnecessary and it pays off to approach the problem analytically. Therefore, here are some questions you may want to ask (yourself or the client's engineer or graphic designer) when you come across garbled characters.
- Remember what we said about the difference between characters and glyphs? In a lot of cases, the characters may be intact, but the font that is used to represent them may not offer glyphs for some of them. Therefore, the first thing to do is to try some other fonts in order to find out if this might be case. This is the easiest problem to fix. In such cases, depending on the application, the characters that cannot be represented may be visualized as question marks, little squares or not at all.
- Does the text look mostly alright, except that most of the special characters have been replaced with different ones? This presumably means that one particular one-byte encoding has been interpreted as a different one-byte encoding or as UTF-8. Depending on the complexity of the publishing process, you may not be able to fix this yourself, but you can give the engineer a valuable hint.
- Same as above, but some special characters have been replaced with a sequence of two different special characters rather than just one different special character. This might mean that UTF-8 has been interpreted as a regional one-byte encoding or it may indicate a double-encoding issue. For instance, your text may have been re-encoded from Eastern European to UTF-8 at some point in the process. Later on, it may have been interpreted as Eastern European and re-encoded again. Again, this indicates a bug in the process about which you cannot anything yourself, but when you spot this phenomenon, you may give the technical experts valuable hints. Note that double-encodings are often irreversible – let's hope there is a backup of an uncorrupted version.
- Does each file have one weird character or a sequence of weird characters at the start? The reason is most probably that the file was encoded using a BOM (byte-order mark), but some application used in the process did not understand byte-order marks.
- Have all special characters disappeared? There may be one piece of software in the process that can only deal with ASCII, ignoring all non-ASCII characters. This is, for instance, the case of some notification emails sent by Yahoo groups. Also, it often happens when semi-professional php scripts are involved.
- Does the entire text look weird and partly corrupted and there is a lot of additional white-space between individual characters? At some point, UTF-16 might have been misinterpreted as UTF-8.
- Unable to use or open a UTF-8 or UTF-16 file in a particular context? Note that some applications (especially on the Windows platform) explicitly require a BOM while others (especially under Linux) explicitly forbid it and cannot deal with it.
- Do you get funny character sequences along the lines of a pattern as in “Gr=C3=BC=C3=9Fe=” (which should read “Grüße”)? At some point, the content was encoded in base 64 or Quoted-Printable without being decoded correctly afterwards. See my article Ever heard of base64? for details.
- In emails and on web pages, there's sometimes a very simple explanation for corrupted characters: the server or the web page or email themselves may be specifying a wrong encoding – for example, when a web page is encoded in UTF-8, but it claims that it is encoded in ISO 8859-2.
For translators it is of particular interest to note that the above-mentioned phenomena of UTF-8 being interpreted as a particular regional one-byte encoding and vice versa are a common cause of character corruption when exchanging translation memories between Trados 6 and Trados 7.
Still, you may find yourself thinking: “Great, but what has it got to do with me? After all, I am a linguist, not an engineer.” Whilst this is true, the engineers, editors or graphic designers involved in the process may not feel too comfortable with text encoding issues either. More importantly, if they do not speak the language in question, it may be harder for them to spot what has really gone wrong and they may appreciate some hints.
Translating the term “encoding”
I have noticed that the term encoding is sometimes mistranslated or misunderstood by linguists who confuse the two fundamentally different concepts of encoding (the representation of information in a form that can be stored or transmitted by computers) and encryption (the process of making this information inaccessible to third parties that do not possess a secret key required to decrypt it).
For example, the German gmail interface (as of today, 18th of March, 2007) claims that you can choose some mysterious “default encryption” (“Standardtextverschlüsselung”) as your selected mode of “encryption” (“Verschlüsselung”) when sending email. Obviously, with all due respect for Google's spirit of innovation, email does not get “encrypted”, with or without gmail. (What they meant to write was "Zeichenkodierung" - "text encoding"). A typical case of “translating” without using your head, the Internet or the slightest bit of common sense which results in Google possibly being seen as lying to its German-speaking users – the few ones that still trust these translations, that is.
You know, Google translators, there's this revolutionary new search engine that actually allows you to research terms and information on the Internet without even leaving the office, hey, you can even look up definitions of terms using the define: syntax (just try it with some random term such as, say, “Verschlüsselung”, you'll be astonished) – what was it called again? Gog... Goo... something like that. Anyway. Go figure it out, it's really great – that German guy told me it even offered a new secure technology for encrypted email transmission...
With this in mind, I hope this article will help save your work from the abyss of character corruption, and, more importantly, I hope your content is actually worth it.
Further reading
Update: Here are a few resources for people who want to learn more about the subject:
- Joel on Software, The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
- Wikipedia article on Unicode
- Official homepage of the Unicode Consortium
- Table of the 65534 first Unicode characters along with their code points
Categories: Language and Translation Localization and Internationalization
Keywords/tags: encoding code page code point unicode utf-8 utf-16 utf-32 bom endianness character corruption text encoding character set glyph font google gmail ucs universal character set localization intenationalization l10n i18n translation
Permalink | Comments (0) | Trackbacks (0)
Trackbacks
Trackback URL for this entry:
http://christianflury.com/cgi-bin/mt/mt-tb.cgi/31


