In computers, the area of character encodings is almost a complete mess.

  • Different systems and different regions use different default character sets.
  • Different fonts only provide glyphs for a different subset of possible characters.
  • Unicode comes in a few different encodings.
  • Systems don’t tell you where errors are.
  • Common libraries don’t provide any visual feedback or exception etc if the incoming bytes are in fact impossible (and therefore wrong) against the expected character encpding.
  • And, to top it off, professional software engineers and IT people usually have had absolutely no training at their schools or unis on character codes: they may get exposed to all sorts of information about floating point and twos complement numbers, but nothing on characters.

There are two underlying architectural issues that are almost completely unresolved:

  1. Systems and technologies are developed either for integrity (i.e. they refuse to accept or process or produce data that is wrong) or they are written for resiliance (i.e. they will try to do their best even in the face of errors): XML is an example of the former, HTML of the latter.
  2. A;most all systems  or technologies do not provide any mechanism for allowing encoded character data to state what encoding it is.  Apple’s old file system provided a resource fork for files that could contain this, but that was a long time ago. But people still use APIs that have no capability to pass on this information; so there is envelope and no post service.  Even if the developers knew what they were doing, they don’t have any API support.

I said Almost all, because there is one notable exception: XML.  It, still uniquely, says that you must specify the encoding your file uses (unless it is UTF-16 or UTF-8). And if you make a mistake, chances are the encoding detection will allow something to be picked up.  It has worked well.  (The other practical choices for technilogoes are merely to insist on ASCII or UTF-8 only.)

So what are my top 3 hints for developers when faced with encoding problems?

  1. Prevention is better than cure.
    • Use UTF-8 for external character data.  Audit your Java etc code to make sure that every time you read or write some characters or bytes, you specify the encoding explicitly. Never leave it to the default platform encoding (unless you know that your system has to communicate with other applications that use the default.
    • In your Unit tests, include a simple test on any byte array that supposedly contains characters in UTF-8 to make sure there are no single non-ascii bytes (<0x00 and > 0x7F depending on whether you have signed or unsigned bytes) that have ASCII bytes both before and after them: UTF-8 does not allow this, and it is a simple check to code.
    • In your programs, only use ASCII characters in any strings or constants.  (You can use whatever characters you like in comments, and what is useful for identifiers because a problem will show up as a compilation error, but errors in strings and constants may not be caught. Each computer language has some convention for numeric character identifiers. (Use hex if possible.)
  2. Have the right tools available.
    • You simply cannot tell  anything reliable about what bytes are present in a file using a text-based tool.  You need a Hex Editor. I use HxD for Windows and gHex on Linux.  If you do not use a hex editor, you will be given misinformation, where you are shown characters in irrelevant character sets: having to keep potentially three different character encodings (the expected, the actual, and as shown by the application) in mind is a nightmare.
      Have an awareness of basic characteristics of different encodings. UTF-16 will frequently begin with bytes 0xFF 0xFE, and will often have every odd-position character as 0x0xx.  UTF-8 will never have an isolated non-ASCII character, and will take two bytes > ASCII for most European characters (and diacriticals), and three bytes for East Asian characters. Then you can use the Hex Editor efficiently.
    • There are websites to help you convert between encodings. For example, http://www.ltg.ed.ac.uk/~richard/utf-8.cgi   lets you convert code points to UTF-8 and tells you what character is being used.
  3. Beware of secret “text” modes. and gotchas
    • Older software and protocols, such as FTP, had a distinction between text and binary mode.  When a file was being transferred in text mode, the systems were free to massage the data to suit the receiving platform; changing line-end conventions or even encoding.  This is why you send XML as content type application/xml  not text/xml let alone text/plain.  But still many software system have text modes that are not advertised. So prudently try to handle text data as binary data (especially XML), unless there is some good reason otherwise.  Even avoiding simple things like saving your XML data with extension .txt  can help.
    • Make sure you are using a font that has the glyphs for the characters you need, when you are eventually looking at the non-ASCII text.   (But beware that if you are not using a Hex Editor, you are at the whim of the application’s developers whether what you see makes sense.)
    • Test at intermediate points in a processing chain.  It is perfectly feasible to create a system that works when processing non-ASCII characters from your own locale, only to find it falls over when using some foreign data.  Your end-to-end could happen to work by hackinstance, while each part of the chain is bogus.  I have seen projects fail because of this.

To give you an example from today. I have been developing a web service, which accepts kind of text file, and produces a large XML file, It seemed to work file, but suddenly we got reports that testing on  Windows client with a Linux server produced mangled characters. It worked fine on my development client, which is Linux.

How did I work out the solution?

  • I tried to replicate the problem on my Linux systems, but was unable to.
  • I identified the characteristics of the error. I took exactly the same input file that was used: I got them to ZIP up the input file and send it, to prevent it being treated as text by any intermediate mail etc system. (#3)  I used the Hex editor to look at the bytes of my output file and the bytes of the problematic output file.  (#2)  I used the website above to know what the bytes were in fact signifying.
  • What I found was the that the naughty file was encoded in UTF-16. What could  be doing that?
  • I reviewed my code to make sure I was not using default outputs, and the that HTTP headers were set correctly. (#1). No problem.
  • So it seemed the server was OK, which left the client.  Could cUrl be operating differently on Windows that Linux. So first I checked if there were any command-line options for making sure that output is written as UT-8. Nothing relevant. I got them to try running under CMD rather than Powershell, in case Powershell’s heuristics got in the way. No change.  So finally, I got them to try using the -o output option of Curl rather than the redirect.  Success.

So this was a #3 kind of gotcha, but it took taking #1 and #2 seriously to get to track it down.

To be more exact, here was the curl command that messed up the encoding on Windows:

curl -X POST --header "Content-Type:application/octet-stream" -H  -T test.enr "http://localhost:7001/myApp/myService" > test.xml

And here is what worked properly on Windows:

curl -X POST --header "Content-Type:application/octet-stream" -H -T test.enr "http://localhost:7001/myApp/myService" -o test.xml

(Why does this gotcha occur? Who knows. Perhaps it is something to do with this?)