Very interesting paragraph here from http://www.ietf.org/rfc/rfc2781.txt, emphasis mine.
It is important to understand that the character 0xFEFF appearing at
any position other than the beginning of a stream MUST be interpreted
with the semantics for the zero-width non-breaking space, and MUST
NOT be interpreted as a byte-order mark. The contrapositive of that
statement is not always true: the character 0xFEFF in the first
position of a stream MAY be interpreted as a zero-width non-breaking
space, and is not always a byte-order mark. For example, if a process
splits a UTF-16 string into many parts, a part might begin with
0xFEFF because there was a zero-width non-breaking space at the
beginning of that substring.
In DFDL, we have no way of knowing whether
a string is supposed to be the real beginning of “a stream”, or is
some chunk of the middle of something. For that reason it is consistent for
DFDL to ALWAYS interpret 0xFEFF as a ZWNBS, and never as a BOM.
So if you want BOM behavior it’s because
the beginning of a stream has special treatment, in this case it is reasonable
to model the BOM as a separate element to be found at the beginning of a “stream”,
optionally hidden, perhaps optional, and compute dfdl:byteOrder in terms of its
value.
I think this position is pretty well
supported by the above paragraph from rfc2781.
…mike
Tel: 781-810-2100 |
From: Steve Hanson
[mailto:smh@uk.ibm.com]
Sent: Wednesday, June 25, 2008
8:55 AM
To: mbeckerle.dfdl@gmail.com
Cc: dfdl-wg@ogf.org
Subject: Re: [DFDL-WG] Required
encodings and testing
Some interesting and official stuff about BOMs here.
http://unicode.org/faq/utf_bom.html
In
IBM WMB we do see some XML UTF-16 data arriving with a BOM on the front of a
file/message, and we handle that. What we don't handle though is the occurrence
of a BOM part way through the file/message. I'm pretty sure it would be treated
as an ordinary code point.
Regards,
Steve
Steve Hanson
Programming Model Architect
WebSphere Message Brokers
Internet: smh@uk.ibm.com
Phone (+44)/(0) 1962-815848
" 25/06/2008 12:37
|
|
I’m a little confused. I think the language
is there in the spec:
Spec v32 says:
encoding |
Enum. Values
are IANA charsets or CCSID[MJB1] [1]s.
This
property can be computed by way of an expression which returns the
appropriate string. Note
that there is, deliberately, no concept of 'native' encoding[2]. Conforming
DFDL v1.0 processors must accept at least 'UTF-8'', “UTF-16”,
“UTF-16BE', ”UTF-16LE', ”ASCII”, and 'ISO-8859-1' [MJB2] as encoding names.
Encoding names are case-insensitive, so “utf-8” and
“UTF-8” are equivalent. The “UTF-16” encoding
requires that dfdl:byteOrder is defined. Annotation:
dfdl:format |
In the references it lists:
IANA
character set encoding names: (http://www.iana.org/assignments/character-sets)
I agree
that neither the minimum list above in the box, nor the reference to the IANA
list are sufficient.
I did not
find any “x- “ character sets described here.
Also,
searching the IANA list I find no mention of BOM. So what list of encodings are
you referring to?
I did
find mention of UTF-16/UCS-2 requiring a BOM in the ICU. This may be a usage
pattern that ICU supports, and if I were coding up something hoping it would be
useful this might be what I would have done too; however, I have not seen data
with this behavior, so I have to question whether this is in any way in use
anywhere. Is it?
While we’re trimming options on encodings, with some web searching I
wasn’t able to find the standard for CCSID other than at an IBM web site.
So while there is a CCSID for iso-8859-1 encoding, that doesn’t mean
CCSIDs are an ISO standard, rather just that they have some conformant sets.
Based on
this I suggest dropping CCSID support since it is a vendor standard only (If
I’m correct.) If this is, however, a de-facto standard even outside
of IBM context then I’ll retract this suggestion.
W.r.t.
BOMs, I spent quite a lot of time on BOMs, mostly due to the hassle that
Unicode specifically says they are not characters; hence, I was shooting for a
semantics where a 10 character string could have either 10 or 11 codepoints in
it due to a BOM being present or absent thereby turning many fixed length
things into variable length. Length determination gets pretty complex if you do
this. You have to look at quite a few properties just to decide whether
something is fixed or variable length.
The last
proposal before we dropped BOMs altogether was to have a special character set
UTF-16-VL (for variable length) which means there may or may not be a BOM. We
concluded that this doesn’t belong in DFDL, I do think the right way to
solve this BOM problem is with identification of encodings that
allow/require/prohibit use of BOMs since a BOM is not a character it must be
part of the character set encoding. E.g., UTF-16-BOM-required,
UTF-16-BOM-prohibited, UTF-16-BOM-allowed, etc. Somebody other than DFDL should
pick the names. The same issue comes up with the UTF-16 with and without the
surrogate-pairs crud. I.e., do you want number of codepoints or do you want the
surrogate-pairs considered to be one character. We used to have a
lengthUnits=”fullUnicodeCharacters” to specify this behavior. This
has been dropped as too complex also. Again UTF-16-VL was the last suggested
way to fix this, i.e., VL for variable length meaning interpret the BOMs, the
surrogate pairs, etc.
One other
issue of this kind is the weird variant of utf-8 where surrogate pairs are
encoded as 3 bytes each rather than using the 4-byte standard utf-8 way of
encoding a 20-bit character code. Again, this should be a new character set
encoding name. E.g., utf-8-encoded-surrogate-pairs. There’s java’s
funny utf-8 variant also where zero is encoded as 2 bytes also. These are all
issues where there is a funny encoding but no standard IANA name for it.
If
someone would like to co-author a suggestion for some new IANA charset encoding
names to propose to whomever that is, I would happily contribute.
At this
point, I’m pretty convinced that we should just say for DFDL v1 a BOM is
a codepoint and we treat it like any other codepoint.
I also
haven’t seen any real use of BOMs. In memory people use native forms and
don’t have these, and externally UTF-8 seems preferred. I’d like to
hear of real BOM usage examples.
I also
don’t think we “have to” support them in that a BOM can be
treated like an optional element that might or might not exist before a string.
Using a combination of valueCalc properties and defaults and a calculated value
for the byteOrder property one can, I believe, achieve every combination of
optional or required BOM and generate them on output or omit in whatever
situations. It will be clumsy, but I prefer this to putting a bunch of
speculative features into the standard where we don’t really have a
strong usage model in mind.
…mike
Tel: 781-810-2100 |
From: dfdl-wg-bounces@ogf.org
[mailto:dfdl-wg-bounces@ogf.org] On Behalf Of
Ian W Parkinson
Sent: Tuesday, June 24, 2008 9:26 AM
To: dfdl-wg@ogf.org
Subject: Re: [DFDL-WG] Required encodings and testing
Hi all,
I'd suggest that we only need worry about those character sets described at http://www.iana.org/assignments/character-sets.
Are the ones beginning "x-" specific to ICU? I think this would simplify
the matter of BOMs somewhat, as we wouldn't need to deal explicitly with
character sets that must have a
BOM (presumably the -BOM variants) and so make the 'spec-twister' a non-issue.
Unicode BOMs would remain a complex issue, though. If the schema specifies
encoding="UTF-16BE" or "UTF16-LE" then our behaviour is
clear enough going by the spec at http://www.ietf.org/rfc/rfc2781.txt -
we never generate a BOM, and any BOM encountered is treated as a character. If
the schema specifies just "UTF-16" (in wihch the BOM is strictly
optional) then we'd honour any BOM at the top of the text field, defaulting to
the specified dfdl:byteOrder value. On unparse we can choose whether or not to
include a BOM - I'd suggest we always include a BOM and use dfdl:byteOrder (*).
If a particular schema needs to control this more explicitly then they can use
an expression to compute UTF-16BE or UTF-16LE as appropriate.
That would leave the following edge-case: a schema which wants to generate
BOMless data so specifies (e.g.) UTF-16LE, but wants to tolerate and honour any
BOM present on parse. Do we need to deal with this unusual situation? It
perhaps could be handled through an optional hidden field, but would we want to
make it easier to achieve?
(*) the alternative would be to leave the byte order up to the implementation,
potentially allowing data to be output with the endianness in which it was
received. This may be beneficial in some situations but would leave the schema
author without a way to specify the byteOrder while still requiring a BOM to be
generated.
Cheers,
Ian
Ian Parkinson
WebSphere ESB Development
Mail Point 211,
From:
|
"RPost" <rp0428@pacbell.net>
|
To:
|
<dfdl-wg@ogf.org> |
Date:
|
24/06/2008 01:58 |
Subject:
|
[DFDL-WG] Required encodings and testing |
Thanks for the response re encodings and issues. Very helpful.
I put my responses in the attachment but here is the first part about encoding.
Your response: We haven't picked a basic set that all conforming
implementations must support other than that UTF-8 and USASCII must be
supported. We might require more than this though.
That’s a relief!
The current spec mentions UTF-8, ebcdic-cp-us (IBM037), and UTF-16BE.
Since Java 1.6 supports 160 encodings using 686 aliases I've no doubt you see
the reason for my question about which encodings need initial support.
ICU supports even more encodings and requiring some of these could implicitly
require implementors to support ICU. Not an issue if that is truly needed but
that requirement alone could dissuade some from participating in the project.
The encodings I have examined/tested so far are: US-ASCII, ISO-8859-1, UTF-8,
UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE, UTF-32LE, IBM1047, IBM500,
IBM037, x-UTF-16LE-BOM, X-UTF-32BE-BOM, X-UTF-32LE-BOM.
I have not run across any issues with any of the above encodings.
ICU includes 175 UCM files of which 135 are for SBCS encodings. I have not
tested or examined all of these but would not expect them to be an issue
either.
Also not examined are the 27 UCM files for MBCS encodings. A brief review shows
that many of these should not be an issue.
BIG5 or GB18030 could definitely be an issue and there are several others like
these that might require a custom effort to support. Ok if you really need it
but better delayed initially if you don't.
Glad to know we don't need to visit these for the short term. I'm sure
implementers would much rather concentrate on the DFDL aspect of things rather
than become encoding experts. I know I would.
--
dfdl-wg mailing list
dfdl-wg@ogf.org
http://www.ogf.org/mailman/listinfo/dfdl-wg
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in
Registered office:
[1] CCSID stands for Coded Character
Set ID, a 3 digit representation for a codepage specifier. TBD: cite relevant
standard for CCSIDs here.
[2] The concept of native character
encoding is avoided in DFDL since a DFDL schema containing such a property
binding does not contain a complete description of data, but rather an
incomplete one which is parameterized by characteristics of the operating environment
where the DFDL processor executes. In DFDL this same behavior is achieved
through use of true parameterization, for example by use of Selectors to choose
among annotations specifying different character set encoding property
bindings.
[MJB1]Cite a standard for CCSID values in
the footnote.
[MJB2]We want this to be as small as
possible a set. Can we get away with just UTF-8,
Also TBD: what aliases of the IANA names are required?
All of them? So, e.g., "Latin1" is accepted? --
dfdl-wg mailing list
dfdl-wg@ogf.org
http://www.ogf.org/mailman/listinfo/dfdl-wg
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in
Registered office: