
We need to add some entries associated with character set and encoding terminology that we use quite a bit. I would note that our usage of the term 'codepoint' differs somewhat from the Unicode Glossary: http://unicode.org/glossary. First, we use codepoint as one word not "code point" (there was some inconsistency on this that I have now fixed), second, what we call codepoint is closer to what Unicode Glossary calls 'code unit'. I suspect we should just provide our definitions rather than switching terms, but I'm open to it if we want to convert all uses of codepoint to "code unit". Encoding - See *Character Set Encoding* Codepoint - When a *character set encoding* uses differing *variable width*representations for characters, the units making up these variable width representations are called codepoints. For example the UTF-8 encoding uses between 1 and 4 codepoints to represent characters, and for UTF-8, the codepoints are single bytes. The UTF-16 encoding is either fixed or variable width. When dfdl:utf16Width='variable' this encoding uses either one or two codepoints per character and each codepoint is a 16-bit value. When a character set is fixed width, then there is no distinction between a codepoint and a character code. Code page - An alternate identifier for a Character Set Encoding. Character Code - The numeric value assigned to a character in a character set that is independent of any specific encoding of that character set. For any fixed-size encoding (all characters have the same size representation) Character Set - An abstract set of characters independent of any specific encoding scheme: Examples are the Unicode character set, or the USASCII character set. Character Set Encoding - A specific representation of a character set as bytes or bits of data. A character set encoding is usually identified by a standard character set name or a recognized alias name, or by a *code page*identifier. These identifiers are standardized by the IANA. Examples are UTF-8, USASCII, GB2312, ebcdic-cp-it, ISO-8859-5, UTF-16BE, Shift_JIS. The DFDL standard allows for implementation-specific character set encodings to be supported, and standardizes one name that is DFDL-specific which is USASCII-7bit-packed. Character Width - The number of codepoints or bytes used to represent a character in a specific character set encoding is called the character width. Encodings are either fixed width (all characters encoded using the same width), or variable-width (different characters are encoded using different widths). For example the UTF-32 character set encoding has 4-byte character width, whereas USASCII has a 1-byte character width. Fixed-Width Character Encoding - A character set encoding where all characters are encoded using a single codepoint for their representation. Note that a codepoint may take up one or more bytes. Surrogate Pair - A Unicode character whose character code value is greater than 0xFFFF can be encoded into variable-width UTF-16BE or UTF-16LE which are variable-width encodings when the DFDL property utf16Width='variable'. In this case the representation uses two adjacent *codepoints *each of which is called a surrogate, and the pair of which is called a surrogate pair. Variable-Width Character Encoding - A character set encoding where characters are encoded using one or more codepoints for their representation depending on which specific character is being encoded. An example is UTF-8 which uses from 1 to 4 bytes to encode a character. ...mike -- Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology | www.tresys.com