Several people in the DFDL WG are hoping
to use the ICU source code as part of a DFDL implementation.
Some modifications will be necessary
(number patterns are enhanced somewhat in DFDL), but in general the hope is to
not reinvent all the character set encoding/decoding technology.
While it is true that DFDL does not want
to require ICU implicitly from the spec alone, the fact that the ICU is there,
is open source, has an appropriate license allowing general use, and has
comprehensive encoding support sort of removes the pressure to minimize
encoding/decoding support in DFDL or any other modern spec. It’s not hard
anymore to provide a quite broad suite encodings. The hardest part is a test
suite that illustrates correct use of each.
…mike
Tel: 781-810-2100 |
From: dfdl-wg-bounces@ogf.org
[mailto:dfdl-wg-bounces@ogf.org] On Behalf Of
RPost
Sent: Monday, June 23, 2008 9:03
PM
To: dfdl-wg@ogf.org
Subject: [DFDL-WG] Required
encodings and testing
Thanks for the response re encodings and issues. Very
helpful.
I put my responses in the attachment but here is the first
part about encoding.
Your response: We haven't
picked a basic set that all conforming implementations must support other than
that UTF-8 and USASCII must be supported. We might require more than this
though.
That’s a relief!
The current spec mentions UTF-8, ebcdic-cp-us (IBM037), and
UTF-16BE.
Since Java 1.6 supports 160 encodings using 686 aliases I've
no doubt you see the reason for my question about which encodings need initial
support.
ICU supports even more encodings and requiring some of these
could implicitly require implementors to support ICU. Not an issue if that is
truly needed but that requirement alone could dissuade some from participating
in the project.
The encodings I have examined/tested so far are: US-ASCII,
ISO-8859-1, UTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE, UTF-32LE,
IBM1047, IBM500, IBM037, x-UTF-16LE-BOM, X-UTF-32BE-BOM, X-UTF-32LE-BOM.
I have not run across any issues with any of the above
encodings.
ICU includes 175 UCM files of which 135 are for SBCS
encodings. I have not tested or examined all of these but would not expect them
to be an issue either.
Also not examined are the 27 UCM files for MBCS encodings. A
brief review shows that many of these should not be an issue.
BIG5 or GB18030 could definitely be an issue and there are
several others like these that might require a custom effort to support. Ok if
you really need it but better delayed initially if you don't.
Glad to know we don't need to visit these for the short
term. I'm sure implementers would much rather concentrate on the DFDL aspect of
things rather than become encoding experts. I know I would.