Unicode and Characters

One of the things I learned at the W3C, particularly when working on the various XML Canonicalization specifications, was that dealing with characters isn’t as easy as it might seem. Joel’s article The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) reminded of my useful conversations with Martin Dürst, and my readings of Charset considered Harmful and Unicode Transformation Formats via Unicode TR#17. Joel does a good job of explaining the issues, but when reading specifications, one is also likely to come across various confusing terms. Here’s my crib sheet:

Character Repertoire (CR) = a set of abstract characters

Coded Character Set (CCS) = a mapping of code values (space, points, positions) to a Character Repertoire

Character Encoding Scheme (CES) = scheme for representing a character repertoire in a code space. Frequently, a (|CR| > |code space|) so one has to do various extensions and escaping to represent those extra charters. UTF-8 is a CES.

Charset = CCS + CES

Comments !