I’m a little
confused. I think the language is there in the spec:
Spec v32 says:
encoding |
Enum. Values
are IANA charsets or CCSID[MJB1] [1]s. This
property can be computed by way of an expression which returns the
appropriate string. Note
that there is, deliberately, no concept of 'native' encoding[2]. Conforming
DFDL v1.0 processors must accept at least 'UTF-8'', “UTF-16”, “UTF-16BE',
”UTF-16LE', ”ASCII”, and 'ISO-8859-1' [MJB2] as encoding
names. Encoding names are case-insensitive, so “utf-8” and
“UTF-8” are equivalent. The “UTF-16” encoding
requires that dfdl:byteOrder is defined. Annotation:
dfdl:format |
In the references it lists:
IANA character set encoding names: (http://www.iana.org/assignments/character-sets)
I agree that neither the minimum list above in the box, nor the
reference to the IANA list are sufficient.
I did not find any “x- “ character sets described here.
Also, searching the IANA list I find no mention of BOM. So what list of
encodings are you referring to?
I did find mention of UTF-16/UCS-2 requiring a BOM in the ICU. This may
be a usage pattern that ICU supports, and if I were coding up something hoping
it would be useful this might be what I would have done too; however, I have
not seen data with this behavior, so I have to question whether this is in any
way in use anywhere. Is it?
While we’re trimming options on encodings, with some web searching I wasn’t
able to find the standard for CCSID other than at an IBM web site. So while
there is a CCSID for iso-8859-1 encoding, that doesn’t mean CCSIDs are an
ISO standard, rather just that they have some conformant sets.
Based on this I suggest dropping CCSID support since it is a vendor
standard only (If I’m correct.) If this is, however, a de-facto
standard even outside of IBM context then I’ll retract this suggestion.
W.r.t. BOMs, I spent quite a lot of time on BOMs, mostly due to the
hassle that Unicode specifically says they are not characters; hence, I was
shooting for a semantics where a 10 character string could have either 10 or 11
codepoints in it due to a BOM being present or absent thereby turning many
fixed length things into variable length. Length determination gets pretty
complex if you do this. You have to look at quite a few properties just to
decide whether something is fixed or variable length.
The last proposal before we dropped BOMs altogether was to have a
special character set UTF-16-VL (for variable length) which means there may or
may not be a BOM. We concluded that this doesn’t belong in DFDL, I do
think the right way to solve this BOM problem is with identification of
encodings that allow/require/prohibit use of BOMs since a BOM is not a
character it must be part of the character set encoding. E.g.,
UTF-16-BOM-required, UTF-16-BOM-prohibited, UTF-16-BOM-allowed, etc. Somebody
other than DFDL should pick the names. The same issue comes up with the UTF-16
with and without the surrogate-pairs crud. I.e., do you want number of
codepoints or do you want the surrogate-pairs considered to be one character. We
used to have a lengthUnits=”fullUnicodeCharacters” to specify this
behavior. This has been dropped as too complex also. Again UTF-16-VL was the
last suggested way to fix this, i.e., VL for variable length meaning interpret
the BOMs, the surrogate pairs, etc.
One other issue of this kind is the weird variant of utf-8 where
surrogate pairs are encoded as 3 bytes each rather than using the 4-byte
standard utf-8 way of encoding a 20-bit character code. Again, this should be a
new character set encoding name. E.g., utf-8-encoded-surrogate-pairs. There’s
java’s funny utf-8 variant also where zero is encoded as 2 bytes also. These
are all issues where there is a funny encoding but no standard IANA name for
it.
If someone would like to co-author a suggestion for some new IANA
charset encoding names to propose to whomever that is, I would happily
contribute.
At this point, I’m pretty convinced that we should just say for
DFDL v1 a BOM is a codepoint and we treat it like any other codepoint.
I also haven’t seen any real use of BOMs. In memory people use
native forms and don’t have these, and externally UTF-8 seems preferred. I’d
like to hear of real BOM usage examples.
I also don’t think we “have to” support them in that a
BOM can be treated like an optional element that might or might not exist before
a string. Using a combination of valueCalc properties and defaults and a
calculated value for the byteOrder property one can, I believe, achieve every
combination of optional or required BOM and generate them on output or omit in
whatever situations. It will be clumsy, but I prefer this to putting a bunch of
speculative features into the standard where we don’t really have a
strong usage model in mind.
…mike
Tel: 781-810-2100 |
From:
dfdl-wg-bounces@ogf.org [mailto:dfdl-wg-bounces@ogf.org] On Behalf Of Ian W Parkinson
Sent: Tuesday, June 24, 2008 9:26
AM
To: dfdl-wg@ogf.org
Subject: Re: [DFDL-WG] Required
encodings and testing
Hi all,
I'd
suggest that we only need worry about those character sets described at http://www.iana.org/assignments/character-sets.
Are the ones beginning "x-" specific to ICU? I think this would
simplify the matter of BOMs somewhat, as we wouldn't need to deal explicitly
with character sets that must
have a BOM (presumably the -BOM variants) and so make the 'spec-twister' a
non-issue.
Unicode
BOMs would remain a complex issue, though. If the schema specifies
encoding="UTF-16BE" or "UTF16-LE" then our behaviour is
clear enough going by the spec at http://www.ietf.org/rfc/rfc2781.txt -
we never generate a BOM, and any BOM encountered is treated as a character. If
the schema specifies just "UTF-16" (in wihch the BOM is strictly
optional) then we'd honour any BOM at the top of the text field, defaulting to
the specified dfdl:byteOrder value. On unparse we can choose whether or not to
include a BOM - I'd suggest we always include a BOM and use dfdl:byteOrder (*).
If a particular schema needs to control this more explicitly then they can use
an expression to compute UTF-16BE or UTF-16LE as appropriate.
That
would leave the following edge-case: a schema which wants to generate BOMless
data so specifies (e.g.) UTF-16LE, but wants to tolerate and honour any BOM
present on parse. Do we need to deal with this unusual situation? It perhaps
could be handled through an optional hidden field, but would we want to make it
easier to achieve?
(*)
the alternative would be to leave the byte order up to the implementation,
potentially allowing data to be output with the endianness in which it was
received. This may be beneficial in some situations but would leave the schema
author without a way to specify the byteOrder while still requiring a BOM to be
generated.
Cheers,
Ian
Ian Parkinson
WebSphere ESB Development
Mail Point 211,
From:
|
"RPost" <rp0428@pacbell.net>
|
To:
|
<dfdl-wg@ogf.org> |
Date:
|
24/06/2008 01:58 |
Subject:
|
[DFDL-WG] Required encodings and testing |
Thanks for the response re encodings and issues. Very helpful.
I put
my responses in the attachment but here is the first part about encoding.
Your
response: We haven't picked a basic set that all conforming implementations
must support other than that UTF-8 and USASCII must be supported. We might
require more than this though.
That’s
a relief!
The
current spec mentions UTF-8, ebcdic-cp-us (IBM037), and UTF-16BE.
Since
Java 1.6 supports 160 encodings using 686 aliases I've no doubt you see the
reason for my question about which encodings need initial support.
ICU
supports even more encodings and requiring some of these could implicitly
require implementors to support ICU. Not an issue if that is truly needed but
that requirement alone could dissuade some from participating in the project.
The
encodings I have examined/tested so far are: US-ASCII, ISO-8859-1, UTF-8,
UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE, UTF-32LE, IBM1047, IBM500,
IBM037, x-UTF-16LE-BOM, X-UTF-32BE-BOM, X-UTF-32LE-BOM.
I have
not run across any issues with any of the above encodings.
ICU
includes 175 UCM files of which 135 are for SBCS encodings. I have not tested
or examined all of these but would not expect them to be an issue either.
Also
not examined are the 27 UCM files for MBCS encodings. A brief review shows that
many of these should not be an issue.
BIG5
or GB18030 could definitely be an issue and there are several others like these
that might require a custom effort to support. Ok if you really need it but
better delayed initially if you don't.
Glad
to know we don't need to visit these for the short term. I'm sure implementers
would much rather concentrate on the DFDL aspect of things rather than become
encoding experts. I know I would.
--
dfdl-wg mailing list
dfdl-wg@ogf.org
http://www.ogf.org/mailman/listinfo/dfdl-wg
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in
Registered office:
[1]
CCSID stands for Coded Character Set ID, a 3 digit representation for a
codepage specifier. TBD: cite relevant standard for CCSIDs here.
[2]
The concept of native character encoding is avoided in DFDL since a DFDL schema
containing such a property binding does not contain a complete description of
data, but rather an incomplete one which is parameterized by characteristics of
the operating environment where the DFDL processor executes. In DFDL this same
behavior is achieved through use of true parameterization, for example by use
of Selectors to choose among annotations specifying different character set
encoding property bindings.