Ian raises a couple of interesting issues.
1. Exactly how should encodings be specified?
I like the idea of using
http://www.iana.org/assignments/character-sets if possible.
Even here the ICU *.ucm files don't always use the 'Name'
specified in the standard
and the standard does not include the 'x-' (experimental -
meaning not a standard).
Java mostly uses the standard 'Name' value but doesn't
always match the ICU name.
Even using the standard the 'Name' itself may not be the
preferred MIME name.
Name:
Extended_UNIX_Code_Packed_Format_for_Japanese
MIBenum:
18
Alias:
csEUCPkdFmtJapanese
Alias:
EUC-JP (preferred MIME Name)
Java supports the name and both aliases. ICU has several in
their 'convrtrs.txt' file including
X-EUC-JP.
The standard says 'The MIBenum value is a unique value for
use in MIBs to identify coded
character sets.'
Perhaps using the standard 'MIBenum' value for uniqueness
and the 'Name' and any or all
of the aliases would work. I’m guessing that an
appendix will ultimately provide the list?
2. For DFDL, when there is a conflict, what is more
important: adherance to a standard?
Or providing schema writers to express requirements unambigously?
This line in the rfc2781 link Ian provides caught my eye: '...addresses
the issues of
serializing UTF-16 as an octet stream for transmission over
the Internet'.
Does this apply since DFDL isn’t really targeting
'transmission over the Internet'? You can't
transmit a binary file (whether it includes UTF-16 or not)
over the internet without
converting it first; often to BASE64.
So I'm not sure the standard for when to include/exclude
BOMs applies. I would suggest that
it if critical to allow a schema writer to specify exactly
what to expect on parse and
what is allowed on unparse.
As long as the writer can do that using a supported encoding
and possibly DFDL properties,
such as byteOrder we're covered. You may very well need to
provide a way to explicitly
specify whether a BOM 'is/is not/might be' present and
whether a BOM 'must be/can be'
written on output.
I didn't mention it in my original post but the BOM issue is
one of the related issues
that Addison Phillips ran into writing classes to serialize
text into fixed width fields.
Namely: for a fixed-width field (width in bytes) how do you
determine how many text characters
of a specified encoding will fit into the field? He had to
take into account BOMs as well
as ensuring that complete 'shift in - shift out' sequences
could be written without overflow.
Then he had a similar issue to figure out the padding.
In the consulting I do my assumption is that data, legacy
and otherwise. I usually have no
problem proving it even if the user Insists it isn’t; it
doesn't necessarily obey the
business rules that it is supposed to. That is the #1
problem I run into as an ETL consultant.
Fields are NULL that shouldn't be. A name field in one
system is VARCHAR2(30) and on another
the same field is VARCHAR2(40). What do you do with data
moving from a '40' to a '30'?
So for DFDL to support the parsing/reading of old legacy
data, possibly because the
original tools don't exist anymore, a schema writer has to
be able to explicitly control
how the data is interpreted whether it meets the standards
or not.
3. To BOM or not to BOM - that is the question.
Use Ian's proposal for doing the standard thing in the
standard way. But ensure that there is
at least some way for a schema writer control explicitly
what parse/unparse will do. I wouldn't
be inclined to add anything to the spec at this point
without a specific use case that
requires it.
Speaking of which - did anyone ever locate the links on your
site where I can find some of
your use case descriptions or discussions?