Responses interspersed
below in this color.
Tel: 781-810-2100 |
From:
dfdl-wg-bounces@ogf.org [mailto:dfdl-wg-bounces@ogf.org] On Behalf Of RPost
Sent: Sunday, June 22, 2008 7:10
PM
To: dfdl-wg@ogf.org
Subject: [DFDL-WG] Initial list of
Required encodings for DFDL version 1
Q - What is the current thinking for the character set
encodings that MUST be implemented by a conforming DFDL processor for version
1?
We haven’t
picked a basic set that all conforming implementations must support other than
that UTF-8 and USASCII must be supported. We might require more than this
though.
I have been performing tests with length-prefixed strings
and strings using terminators to see what issues affect the ability to detect
the boundaries between strings and binary data or terminator strings that
immediately follow the string.
For length-prefixed strings you need to be able to either
encode the byte array and iterate the string character by character or perform
byte counting using only the byte stream and the bit ranges in the bytes
themselves..
Issue #1 – It will not be trivial to create all of the
test cases to fully test the corner cases for each encoding. Obviously the
fewer encodings that have to be supported initially the better in terms of
implementation.
I don’t
understand your concern here. Yes, there are a few cases to test. E.g., length
measured in bytes, but character set is variable width (like utf-8 or shift-JIS)
means number of characters is <= number of bytes. Length in characters, and
variable width character set means number of bytes >= number of characters. All
other cases you don’t need to know anything other than the character set
width. These test cases are easy to enumerate.
All the above must be
supported unless we allow a conforming DFDL implementation to support only say,
single-byte USASCII.
Issue #2 - There is no current support for byte counting in
Java or ICU. For encodings that are pure single-byte or pure multi-byte the end
of the string can be found by examining the byte string itself without
performing character encoding. The classes available all perform conversions of
entire buffers (or series of buffers) and the classes also consume large
amounts of the byte stream.
The above are not
limitations we can consider. Yes, data format support is inadequate in these
systems. That’s why we need a standard here, because it is too hard for
people to implement and they need the reassurance of a standard in order to
justify the investment.
For some encodings (e.g. UTF-8) an algorithmic process can
examine byte values and determine if a character consumes 1, 2 or more bytes.
Still other encodings will need to have custom processes
written to either encode and iterate the string or use a specially designed
table to perform byte counting.
This is true and I don’t
see this as a problem.
As with issue #1 the fewer encodings needing special
handling that need to be supported initially the fewer problems for
implementers.
To me the minimum
interesting set of encodings is utf-8, usascii, ebcdic-cp-1, iso-8859-1,
utf-16BE, utf-16LE. Without these there is a huge amount you cannot do.
It’s unlikely we’ll
get DFDL through standardization without also including the important
international sets for both Europe (iso-8859-N for various N) and
Issue #3 – some encodings have multiple possible byte
representations for the same character. If a terminator string specified as
‘END’ in a DFDL property it must be converted to the proper
encoding when searching for it. The easiest way to do this is to encode it,
convert the encoded value to a byte array and then search the input stream byte
array for a match. The binary file could include bytes that express one
encoding of the character and the Java code could be searching for the
character using another byte representation.
Careful. You can’t
just search the data for a pattern as you may get a false match on binary data.
DFDL does handle the
above issues with it’s character entities system.
Q – Does the DFDL spec need to allow a terminator to
be specified as a hex byte array so that the exact byte sequence to search for
can be specified?
Yes. “foo%x66;bar”
looks for the hex byte 66 after the “o” and before the “b”.
Note that in a 2 byte character encoding one must put two bytes in here. The
entity inserts only a single uninterpreted byte.
Issue #4 – If a string can be specified as using one
encoding and a terminator can use a different encoding
This kind of thing
gets discussed sometimes. This is simply an ambiguous concept unless there is
some other way of knowing the length. If the terminator you mention is actually
the delimiter for the length of the string, then this concept is broken. If the
terminator is just more data found after a say, fixed length, string,
then there is no problem here as the DFDL system would know when to change
encodings.
is it possible that the terminator byte sequence is
also a valid string byte sequence even though the characters being represented
are different? I haven’t been able to determine if this can happen.
Q – Does the DFDL spec need to disallow different
encodings for strings and terminators for version 1? Or are you confident that
this corner case is unlikely to be an issue.
We have recently
discussed something called “variable markup” which can express all
these corner cases. We’ve decided that separate encoding control for
delimiters is too obscure. We allow case-sensitivity control for delimiters,
but anything beyond that uses variable markup.
I have been in contact with Addison Phillips, the current
chair of the W3C Internationalization core WG, and he ran into many of the
above issues when implementing character set providers for WebMethos (since
consumed by SoftwareAG). He also referred me to a contact at ICU and I hope to
hear from them in the next week or two.
Meanwhile, any thoughts or suggestions you have on the above
would be appreciated.
While I am waiting for feedback from ICU and Addison I am
trying to determine an effective way to set up an automated test harness that
can be used to generate different combinations of strings, terminators and
encodings and perform volume testing. Mike suggested using the test example he
provided but it only showed one data string for input. That might be adequate
for simple tests but, because test cases may need to be shared by multiple test
XSD files it may not be scalable for volume testing or testing of multiple
cases.
Test cases need to be
shared by multiple XSD files? Can you explain this? A test case is a
combination of data and schema isn’t it?
…mike