[DFDL-WG] Glossary items needed (for a v12 errata?)

28 Jan 2013

      We need to add some entries associated with character set and encoding
terminology that we use quite a bit.

I would note that our usage of the term 'codepoint' differs somewhat from
the Unicode Glossary: http://unicode.org/glossary. First, we use codepoint
as one word not "code point" (there was some inconsistency on this that I
have now fixed), second, what we call codepoint is closer to what Unicode
Glossary calls 'code unit'. I suspect we should just provide our
definitions rather than switching terms, but I'm open to it if we want to
convert all uses of codepoint to "code unit".

Encoding - See *Character Set Encoding*

Codepoint - When a *character set encoding* uses differing *variable
width*representations for characters, the units making up these
variable width
representations are called codepoints. For example the UTF-8 encoding uses
between 1 and 4 codepoints to represent characters, and for UTF-8, the
codepoints are single bytes. The UTF-16 encoding is either fixed or
variable width. When dfdl:utf16Width='variable' this encoding uses either
one or two codepoints per character and each codepoint is a 16-bit value.
When a character set is fixed width, then there is no distinction between a
codepoint and a character code.

Code page - An alternate identifier for a Character Set Encoding.

Character Code - The numeric value assigned to a character in a character
set that is independent of any specific encoding of that character set. For
any fixed-size encoding (all characters have the same size representation)

Character Set - An abstract set of characters independent of any specific
encoding scheme: Examples are the Unicode character set, or the USASCII
character set.

Character Set Encoding - A specific representation of a character set as
bytes or bits of data. A character set encoding is usually identified by a
standard character set name or a recognized alias name, or by a *code
page*identifier. These identifiers are standardized by the IANA.
Examples are
UTF-8, USASCII, GB2312, ebcdic-cp-it,  ISO-8859-5, UTF-16BE, Shift_JIS. The
DFDL standard allows for implementation-specific character set encodings to
be supported, and standardizes one name that is DFDL-specific which is
USASCII-7bit-packed.

Character Width - The number of codepoints or bytes used to represent a
character in a specific character set encoding is called the character
width. Encodings are either fixed width (all characters encoded using the
same width), or variable-width (different characters are encoded using
different widths). For example the UTF-32 character set encoding has 4-byte
character width, whereas USASCII has a 1-byte character width.

Fixed-Width Character Encoding - A character set encoding where all
characters are encoded using a single codepoint for their representation.
Note that a codepoint may take up one or more bytes.

Surrogate Pair - A Unicode character whose character code value is greater
than 0xFFFF can be encoded into variable-width UTF-16BE or UTF-16LE which
are variable-width encodings when the DFDL property utf16Width='variable'.
In this case the representation uses two adjacent *codepoints *each of
which is called a surrogate, and the pair of which is called a surrogate
pair.

Variable-Width Character Encoding - A character set encoding where
characters are encoded using one or more codepoints for their
representation depending on which specific character is being encoded. An
example is UTF-8 which uses from 1 to 4 bytes to encode a character.

...mike

-- 
Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology |
www.tresys.com