DFDL schemas can either:
1) specify fixed encoding(s)/byte order(s)
for the data being described,
2) specify that the encoding/byte order
is provided by the 'context' that invokes the DFDL processor (using the
dfdl:defineVariable 'external' facility). **
For case 1), DFDL is faced with a problem.
Namely what happens when the 'context' provides an encoding/byte order
for the data, but the DFDL xsd specifies a different encoding/byte order.
I think DFDL must make a statement about this situation, as there are several
common scenarios where this could occur (HTTP, MIME, MQ).
It is worth looking at the precedent
set by XML in this regards. The analogous problem for XML is where the
XML document itself specifies a different encoding (using the ?xml declaration)
to the context. The recommendations for XML are stated in the appendix
below - there is no universal rule.
It is more complicated with DFDL though.
A DFDL xsd can set up the encoding(s)/byte order(s) to use in several
different places. Which of those would the context override? All of them?
Just the one associated with the top-level structure?
My conclusion is therefore that for
case 1) the DFDL xsd always wins, and the context is ignored. If the user
wants to use the encoding/byte order from the context, then he must be
explicit about this and use case 2) above.
There are two things that we could allow
to be a bit more flexible:
a) Pre-define $encoding and $byteOrder
variables in the DFDL namespace. These would implictly have 'external'
= 'true' and perhaps a 'defaultValue' as well. This simplifies the
coding of a DFDL xsd for case 2).
b) State that it is an implementation
decision to provide an option to use a context encoding/byte order
for case 1) instead of the ones in the DFDL xsd. In such a case, the context
MUST override all encodings/byte orders in the system of xsds used
by the DFDL processor. (In practice this is invariably a single encoding/byte
order). .
** (Might be more than encoding &
byte order - for example MQ also allows float format to be provided by
context)
Appendix: XML
The equivalent situation for XML is where
the XML document specifies its own encoding via the ?xml declaration, and
the context also provides the encoding. There is no single rule, in summary:
-
Basicaly if there is a higher level protocol, then that defines the rules.
-
Eg, for MIME content-type text/xml, the context encoding is used. If this
is omitted, the xml is assumed to be US-ASCII. The ?xml declaration
encoding is not used.
-
Eg, for MIME content-type application/xml, the context encoding is used
If this is omitted, the ?xml declaration encoding is used.
-
For files (where there is no context encoding) use of the ?xml declaration
encoding is recommended.
Note that in Message Broker, we always
use the context encoding, as it should always be present. We never use
the ?xml declaration.
W3C XML 1.0 spec section F.2 Priorities
in the Presence of External Encoding Information
The second possible case occurs when the XML entity is
accompanied by encoding information, as in some file systems and some network
protocols. When multiple sources of information are available, their relative
priority and the preferred method of handling conflict should be specified
as part of the higher-level protocol used to deliver XML. In particular,
please refer to [IETF
RFC 3023] or its successor, which defines the
text/xml and application/xml
MIME types and provides some useful guidance. In the interests of interoperability,
however, the following rule is recommended.
- If an XML entity is in a file, the Byte-Order Mark and
encoding declaration are used (if present) to determine the character encoding.
IETF RFC 3023
3.6 Summary
The following list applies to text/xml, text/xml-external-parsed-
entity, and XML-based media types under the top-level type "text"
that define the charset parameter according to this specification:
o Charset parameter is strongly recommended.
o If the charset parameter is not specified, the default
is "us-
ascii". The default of "iso-8859-1"
in HTTP is explicitly
overridden.
o No error handling provisions.
o An encoding declaration, if present, is irrelevant, but
when
saving a received resource as a file, the correct
encoding
declaration SHOULD be inserted.
The next list applies to application/xml, application/xml-external-
parsed-entity, application/xml-dtd, and XML-based media types under
top-level types other than "text" that define the charset
parameter
according to this specification:
o Charset parameter is strongly recommended, and if present,
it
takes precedence.
o If the charset parameter is omitted, conforming XML processors
MUST follow the requirements in section 4.3.3 of [XML].
Regards
Steve Hanson
Programming Model Architect, WebSphere Message Brokers,
OGF DFDL WG Co-Chair,
Hursley, UK,
Internet: smh@uk.ibm.com,
Phone (+44)/(0) 1962-815848
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6
3AU