Hi all,
I'd suggest that we only need worry
about those character sets described at http://www.iana.org/assignments/character-sets.
Are the ones beginning "x-" specific to ICU? I think this would
simplify the matter of BOMs somewhat, as we wouldn't need to deal explicitly
with character sets that must have a BOM (presumably the -BOM variants)
and so make the 'spec-twister' a non-issue.
Unicode BOMs would remain a complex
issue, though. If the schema specifies encoding="UTF-16BE" or
"UTF16-LE" then our behaviour is clear enough going by the spec
at http://www.ietf.org/rfc/rfc2781.txt
- we never generate a BOM, and any BOM encountered is treated as a character.
If the schema specifies just "UTF-16" (in wihch the BOM is strictly
optional) then we'd honour any BOM at the top of the text field, defaulting
to the specified dfdl:byteOrder value. On unparse we can choose whether
or not to include a BOM - I'd suggest we always include a BOM and use dfdl:byteOrder
(*). If a particular schema needs to control this more explicitly then
they can use an expression to compute UTF-16BE or UTF-16LE as appropriate.
That would leave the following edge-case:
a schema which wants to generate BOMless data so specifies (e.g.) UTF-16LE,
but wants to tolerate and honour any BOM present on parse. Do we need to
deal with this unusual situation? It perhaps could be handled through an
optional hidden field, but would we want to make it easier to achieve?
(*) the alternative would be to leave
the byte order up to the implementation, potentially allowing data to be
output with the endianness in which it was received. This may be beneficial
in some situations but would leave the schema author without a way to specify
the byteOrder while still requiring a BOM to be generated.
Cheers,
Ian
Ian Parkinson
WebSphere ESB Development
Mail Point 211, Hursley Park, Hursley, Winchester, SO21 2JN, UK
From:
| "RPost" <rp0428@pacbell.net>
|
To:
| <dfdl-wg@ogf.org>
|
Date:
| 24/06/2008 01:58
|
Subject:
| [DFDL-WG] Required encodings and testing |
Thanks for the response re encodings and
issues. Very helpful.
I put my responses in the attachment but
here is the first part about encoding.
Your response: We haven't picked a
basic set that all conforming implementations must support other than that
UTF-8 and USASCII must be supported. We might require more than this though.
That’s a relief!
The current spec mentions UTF-8, ebcdic-cp-us
(IBM037), and UTF-16BE.
Since Java 1.6 supports 160 encodings using
686 aliases I've no doubt you see the reason for my question about which
encodings need initial support.
ICU supports even more encodings and requiring
some of these could implicitly require implementors to support ICU. Not
an issue if that is truly needed but that requirement alone could dissuade
some from participating in the project.
The encodings I have examined/tested so far
are: US-ASCII, ISO-8859-1, UTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE,
UTF-32LE, IBM1047, IBM500, IBM037, x-UTF-16LE-BOM, X-UTF-32BE-BOM, X-UTF-32LE-BOM.
I have not run across any issues with any
of the above encodings.
ICU includes 175 UCM files of which 135 are
for SBCS encodings. I have not tested or examined all of these but would
not expect them to be an issue either.
Also not examined are the 27 UCM files for
MBCS encodings. A brief review shows that many of these should not be an
issue.
BIG5 or GB18030 could definitely be an issue
and there are several others like these that might require a custom effort
to support. Ok if you really need it but better delayed initially if you
don't.
Glad to know we don't need to visit these
for the short term. I'm sure implementers would much rather concentrate
on the DFDL aspect of things rather than become encoding experts. I know
I would.
--
dfdl-wg mailing list
dfdl-wg@ogf.org
http://www.ogf.org/mailman/listinfo/dfdl-wg
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6
3AU