Ian raises a couple of interesting issues.

 

1. Exactly how should encodings be specified?

 

I like the idea of using http://www.iana.org/assignments/character-sets if possible.

 

Even here the ICU *.ucm files don't always use the 'Name' specified in the standard

and the standard does not include the 'x-' (experimental - meaning not a standard).

 

Java mostly uses the standard 'Name' value but doesn't always match the ICU name.

 

Even using the standard the 'Name' itself may not be the preferred MIME name.

            Name: Extended_UNIX_Code_Packed_Format_for_Japanese

            MIBenum: 18

            Alias: csEUCPkdFmtJapanese

            Alias: EUC-JP (preferred MIME Name)

           

Java supports the name and both aliases. ICU has several in their 'convrtrs.txt' file including

X-EUC-JP.

 

The standard says 'The MIBenum value is a unique value for use in MIBs to identify coded

character sets.'

 

Perhaps using the standard 'MIBenum' value for uniqueness and the 'Name' and any or all

of the aliases would work. I’m guessing that an appendix will ultimately provide the list?

 

2. For DFDL, when there is a conflict, what is more important: adherance to a standard?

Or providing schema writers to express requirements unambigously?

 

This line in the rfc2781 link Ian provides caught my eye: '...addresses the issues of

serializing UTF-16 as an octet stream for transmission over the Internet'.

 

Does this apply since DFDL isn’t really targeting 'transmission over the Internet'? You can't

transmit a binary file (whether it includes UTF-16 or not) over the internet without

converting it first; often to BASE64.

 

So I'm not sure the standard for when to include/exclude BOMs applies. I would suggest that

it if critical to allow a schema writer to specify exactly what to expect on parse and

what is allowed on unparse.

 

As long as the writer can do that using a supported encoding and possibly DFDL properties,

such as byteOrder we're covered. You may very well need to provide a way to explicitly

specify whether a BOM 'is/is not/might be' present and whether a BOM 'must be/can be'

written on output.

 

I didn't mention it in my original post but the BOM issue is one of the related issues

that Addison Phillips ran into writing classes to serialize text into fixed width fields.

Namely: for a fixed-width field (width in bytes) how do you determine how many text characters

of a specified encoding will fit into the field? He had to take into account BOMs as well

as ensuring that complete 'shift in - shift out' sequences could be written without overflow.

Then he had a similar issue to figure out the padding.

 

In the consulting I do my assumption is that data, legacy and otherwise. I usually have no

problem proving it even if the user Insists it isn’t; it doesn't necessarily obey the

business rules that it is supposed to. That is the #1 problem I run into as an ETL consultant.

 

Fields are NULL that shouldn't be. A name field in one system is VARCHAR2(30) and on another

the same field is VARCHAR2(40). What do you do with data moving from a '40' to a '30'?

 

So for DFDL to support the parsing/reading of old legacy data, possibly because the

original tools don't exist anymore, a schema writer has to be able to explicitly control

how the data is interpreted whether it meets the standards or not.

 

3. To BOM or not to BOM - that is the question.

 

Use Ian's proposal for doing the standard thing in the standard way. But ensure that there is

at least some way for a schema writer control explicitly what parse/unparse will do. I wouldn't

be inclined to add anything to the spec at this point without a specific use case that

requires it.

 

Speaking of which - did anyone ever locate the links on your site where I can find some of

your use case descriptions or discussions?