On Mon, Jan 28, 2013 at 10:57 AM, Mike Beckerle <mbeckerle.dfdl@gmail.com> wrote:

We need to add some entries associated with character set and encoding terminology that we use quite a bit.

I would note that our usage of the term 'codepoint' differs somewhat from the Unicode Glossary: http://unicode.org/glossary. First, we use codepoint as one word not "code point" (there was some inconsistency on this that I have now fixed), second, what we call codepoint is closer to what Unicode Glossary calls 'code unit'. I suspect we should just provide our definitions rather than switching terms, but I'm open to it if we want to convert all uses of codepoint to "code unit".

Encoding - See Character Set Encoding

Codepoint - When a character set encoding uses differing variable width representations for characters, the units making up these variable width representations are called codepoints. For example the UTF-8 encoding uses between 1 and 4 codepoints to represent characters, and for UTF-8, the codepoints are single bytes. The UTF-16 encoding is either fixed or variable width. When dfdl:utf16Width='variable' this encoding uses either one or two codepoints per character and each codepoint is a 16-bit value. When a character set is fixed width, then there is no distinction between a codepoint and a character code.

Code page - An alternate identifier for a Character Set Encoding.

Character Code - The numeric value assigned to a character in a character set that is independent of any specific encoding of that character set. For any fixed-size encoding (all characters have the same size representation)

Character Set - An abstract set of characters independent of any specific encoding scheme: Examples are the Unicode character set, or the USASCII character set.

Character Set Encoding - A specific representation of a character set as bytes or bits of data. A character set encoding is usually identified by a standard character set name or a recognized alias name, or by a code page identifier. These identifiers are standardized by the IANA. Examples are UTF-8, USASCII, GB2312, ebcdic-cp-it, ISO-8859-5, UTF-16BE, Shift_JIS. The DFDL standard allows for implementation-specific character set encodings to be supported, and standardizes one name that is DFDL-specific which is USASCII-7bit-packed.

Character Width - The number of codepoints or bytes used to represent a character in a specific character set encoding is called the character width. Encodings are either fixed width (all characters encoded using the same width), or variable-width (different characters are encoded using different widths). For example the UTF-32 character set encoding has 4-byte character width, whereas USASCII has a 1-byte character width.

Fixed-Width Character Encoding - A character set encoding where all characters are encoded using a single codepoint for their representation. Note that a codepoint may take up one or more bytes.

Surrogate Pair - A Unicode character whose character code value is greater than 0xFFFF can be encoded into variable-width UTF-16BE or UTF-16LE which are variable-width encodings when the DFDL property utf16Width='variable'. In this case the representation uses two adjacent codepoints each of which is called a surrogate, and the pair of which is called a surrogate pair.

Variable-Width Character Encoding - A character set encoding where characters are encoded using one or more codepoints for their representation depending on which specific character is being encoded. An example is UTF-8 which uses from 1 to 4 bytes to encode a character.

...mike

--
Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology | www.tresys.com