I dunno, "errorWithFallback"
implies to me that both will happen; that we would report an error and
then try the fallback mapping. How about:
1) alwaysError
2) alwaysReplace
3) fallbackOrError
4) fallbackOrReplace
I'm not too bothered though so happy
to go with your names if we're all happy.
Cheers,
Andy
Andy
Edwards - IBM
Integration Bus -
DFDL
|
Email:
| andy.edwards@uk.ibm.com
|
Snail
Mail:
| MP211,
Hursley park, Hursley, WINCHESTER, Hants, SO21 2JN
|
Tel
int:
| 247222
|
Tel
ext:
| +44
(0)1962 817222
|
Desk:
| DE3
V17 |
| The
Feynman problem solving Algorithm
1) Write down the problem
2) Think real hard
3) Write down the answer
-- Murray Gell-mann in the NY Times |
From:
Steve Hanson/UK/IBM
To:
Andrew Edwards/UK/IBM@IBMGB
Cc:
Mike Beckerle <mbeckerle.dfdl@gmail.com>,
DFDL-WG <dfdl-wg@ogf.org>
Date:
14/09/2015 12:03
Subject:
Re: [DFDL-WG]
Action 283: Provision for fallback mappings
How about
1) Error unmappable characters; fallbacks
not required => "error"
2) Replace unmappable characters; fallbacks not required => "replace"
3) Error unmappable characters; fallbacks required => "errorWithFallback"
4) Replace unmappable characters; fallbacks required => "replaceWithFallback"
As I understand it, fallback is only
applicable when unparsing (from Unicode to codepage). I assume that
in this case "fallbackOrError" behaves like "error"
and "fallbackOrReplace" behaves like "replace" and
that we'd explicitly state in the spec that this is the case.
Correct.
Regards
Steve Hanson
Architect, IBM
DFDL
Co-Chair, OGF
DFDL Working Group
IBM SWG, Hursley, UK
smh@uk.ibm.com
tel:+44-1962-815848
From:
Andrew Edwards/UK/IBM
To:
Steve Hanson/UK/IBM@IBMGB
Cc:
Mike Beckerle <mbeckerle.dfdl@gmail.com>,
DFDL-WG <dfdl-wg@ogf.org>
Date:
08/09/2015 16:51
Subject:
Re: [DFDL-WG]
Action 283: Provision for fallback mappings
I'm in favour of extra enumerations
on dfdl:encodingErrorPolicy.
Could we be more verbose on the fallback
cases? So we'd have:
1) Error unmappable characters; fallbacks
not required => "error"
2) Replace unmappable characters; fallbacks not required => "replace"
3) Error unmappable characters; fallbacks required => "fallbackOrError"
4) Replace unmappable characters; fallbacks required => "fallbackOrReplace"
As I understand it, fallback is only
applicable when unparsing (from Unicode to codepage). I assume that
in this case "fallbackOrError" behaves like "error"
and "fallbackOrReplace" behaves like "replace" and
that we'd explicitly state in the spec that this is the case.
Cheers,
Andy
Andy
Edwards - IBM
Integration Bus -
DFDL
|
Email:
| andy.edwards@uk.ibm.com
|
Snail
Mail:
| MP211,
Hursley park, Hursley, WINCHESTER, Hants, SO21 2JN
|
Tel
int:
| 247222
|
Tel
ext:
| +44
(0)1962 817222
|
Desk:
| DE3
V17 |
| The
Feynman problem solving Algorithm
1) Write down the problem
2) Think real hard
3) Write down the answer
-- Murray Gell-mann in the NY Times |
From:
Steve Hanson/UK/IBM@IBMGB
To:
Mike Beckerle <mbeckerle.dfdl@gmail.com>
Cc:
DFDL-WG <dfdl-wg@ogf.org>
Date:
27/08/2015 09:51
Subject:
Re: [DFDL-WG]
Action 283: Provision for fallback mappings
Sent by:
dfdl-wg-bounces@ogf.org
It's obviously less disruptive to the
DFDL spec to add extra enums to dfdl:encodingErrorPolicy. My concern
in doing that is the orthogonality of substitition characters (an error
has occurred) and fallbacks (defined mappings for a purpose). So let's
look at the scenarios we need to support and see if that can generate a
set of reasonably natural enums:
1) Error unmappable characters; fallbacks not required => "error"
2) Replace unmappable characters; fallbacks not required => "replace"
3) Error unmappable characters; fallbacks required => "fallback"
4) Replace unmappable characters; fallbacks required => "fallbackOrReplace"
I think two new enums are needed as one IBM product that uses IBM DFDL
said it wanted fallback but not substitution.
Regards
Steve Hanson
Architect, IBM
DFDL
Co-Chair, OGF
DFDL Working Group
IBM SWG, Hursley, UK
smh@uk.ibm.com
tel:+44-1962-815848
From: Mike
Beckerle <mbeckerle.dfdl@gmail.com>
To: Steve
Hanson/UK/IBM@IBMGB
Cc: DFDL-WG
<dfdl-wg@ogf.org>
Date: 26/08/2015
14:32
Subject: Re:
[DFDL-WG] Action 283: Provision for fallback mappings
Or... perhaps dfdl:encodingErrorPolicy="replaceOrFallback", that
is, perhaps we can just add another enum value to reflect this policy rather
than adding more properties.
Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology | www.tresys.com
Please note: Contributions to the DFDL Workgroup's email discussions are
subject to the OGF
Intellectual Property Policy
On Tue, Aug 25, 2015 at 10:56 AM, Mike Beckerle <mbeckerle.dfdl@gmail.com>
wrote:
Would an IBM-specific property, to be proposed for future inclusion in
DFDL. E.g., something like
ibmdfdl:encodingErrorFallbackPolicy="never" or "fallback"
with other enums reserved for the future.
I would like to pave a path for these sorts of proposed features. It would
be good to see if this alone is sufficient to meet your customer's needs
who are asking for this, or whether they will need even a bit more control
than this.
It looks like we just missed some unparse behavior in dfdl:encodingErrorPolicy="replace",
as clearly when a Unicode character has no mapping, and the target encoding
is SBCS and ascii-derived, then the 0x1A character is the right thing.
However, I know what will happen in Daffodil is what the standard ICU library
does, with its default mapping definitions, and I don't know that this
0x1A substitution character is properly used in those mappings.
Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology | www.tresys.com
Please note: Contributions to the DFDL Workgroup's email discussions are
subject to the OGF
Intellectual Property Policy
On Tue, Aug 25, 2015 at 9:29 AM, Steve Hanson <smh@uk.ibm.com>
wrote:
Today the DFDL 1.0 spec has property dfdl:encodingErrorPolicy to control
what happens when an unmappable or malformed character is encountered -
'error' or 'replace'. When 'replace' the appropriate substitution character
is used.
There is also the orthogonal question of fallback mappings, which are mappings
specified by an encoding which is not a normal round-trip mapping. DFDL
does not currently provide for switching on fallback mappings. Here's what
ICU says about this at http://userguide.icu-project.org/conversion/data.
In the CHARMAP section of a .ucm file, each line contains a Unicode code
point (like <U(1-6 hexadecimal digits for the code point)> ), a codepage
character byte sequence (each byte like \xhh (2 hexadecimal digits} ),
and an optional "precision" or "fallback" indicator.
The precision indicator either must be present in all mappings or in none
of them. The indicator is a pipe symbol ‘|’ followed by a 0, 1, 2, 3,
or 4 that has the following meaning:
- |0 - A "normal", roundtrip mapping from a
Unicode code point and back.
- |1 - A "fallback" mapping only from Unicode
to the codepage, but not back.
- |2 – A subchar1 mapping. The code point is unmappable,
and if a substitution is performed, then the subchar1 should be used rather
than the subchar. Otherwise, such mappings are ignored.
- |3 - A "reverse fallback" mapping only from
the codepage to Unicode, but not back to the codepage.
- |4 - A "good one-way" mapping only from Unicode
to the codepage, but not back.
Fallback mappings
from Unicode typically do not map codes for the same character, but for
"similar" ones. This mapping is sometimes done if a character
exists in Unicode but not in the codepage. To replace it, ICU maps a codepage
code to a similar-looking code for human-readable output. This mapping
feature is not useful for text data transmission especially in markup languages
where a Unicode code point can be escaped with its code point value. The
ICU application programming interface (API) ucnv_setFallback()
controls this fallback behavior.
"Reverse fallbacks" are technically similar,
but the same Unicode character can be encoded twice in the codepage. ICU
always uses reverse fallbacks at runtime.
A subset of the fallback mappings from Unicode is always
used at runtime: Those that map private-use Unicode code points. Fallbacks
from private-use code points are often introduced as replacements for previous
roundtrip mappings for the same pair of codes. These replacements are used
when a Unicode version assigns a new character that was previously mapped
to that private-use code point. The mapping table is then changed to map
the same codepage byte sequence to the new Unicode code point (as a new
roundtrip) and the mapping from the old private-use code point to the same
codepage code is preserved as a fallback.
A "good one-way" mapping is like a fallback,
but ICU always uses "good one-way" mappings at runtime, regardless
of the fallback API flag.
The idea is that fallbacks normally lose information,
such as mapping from a compatibility variant of a letter to the ASCII version;
however, fallbacks from PUA and reverse fallbacks are assumed to be for
"the same character", just an older code for it.
So the default behaviour for ICU is to use "good one-way" mappings,
"reverse fallback" mappings, and "fallback" mappings
from private-use-area code points, but only to use normal "fallback"
mappings if the setFallback API has been used.
IBM customers have requested the ability to use normal "fallback"
mappings. At the current time, the only solution open to them is to change
the .ucm file (or create a variant) and change the "|1" mappings
to "|4" so that "fallback" mappings become "good
one-way" mappings.
A proposal to support fallbacks was submitted a few years ago by Mike.
https://www.ogf.org/pipermail/dfdl-wg/2011-November/001631.html.
It proposed adding new DFDL annotations to allow replacement characters
and fallback mappings to be specified. This was rejected as ICU already
provides this via the .ucm file. But no simpler alternative materialised,
and the resulting erratum only added dfdl:encodingErrorPolicy, which does
not handle fallbacks.
Given a) the precedent of existing IBM DFDL and Daffodil behaviour which
(should) match the ICU default, b) the orthogonality of substitition characters
(an error has occurred) and fallbacks (defined mappings for a purpose),
and b) an IBM recommendation not to switch on fallbacks by default, it
feels like we need a new property eg: dfdl:useEncodingFallbacks 'yes'
| 'no'. Alternatives welcome. The names dfdl:encodingFallbackPolicy
or dfdl:encodingPrecisionPolicy are better, but then comes the problem
of finding meaningful enum values...
Also noted: The woridng for dfdl:encodingErrorPolicy 'replace' says: If
'replace' then any error when decoding characters results in the insertion
of the Unicode Replacement Character (U+FFFD) as the replacement for that
error. That is not strictly true, as the same ICU page says:
- Conversion from a codepage to Unicode occurs and an
unassigned codepoint is found
1. If the input sequence
is of length 1 and a subchar1 byte is specified for the codepage [in
the .ucm file], output U+001A
2. Otherwise output
U+FFFD
There is then the question of how do the two properties interact. Specifically,
if fallbacks are not being used, does encountering a code point with a
fallback result dfdl:encodingErrorPolicy coming in to play? I suspect
so but needs verifying.
Regards
Steve Hanson
Architect, IBM
DFDL
Co-Chair, OGF
DFDL Working Group
IBM SWG, Hursley, UK
smh@uk.ibm.com
tel:+44-1962-815848
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6
3AU
--
dfdl-wg mailing list
dfdl-wg@ogf.org
https://www.ogf.org/mailman/listinfo/dfdl-wg
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6
3AU--
dfdl-wg mailing list
dfdl-wg@ogf.org
https://www.ogf.org/mailman/listinfo/dfdl-wg
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6
3AU