September 2015 - dfdl-wg

Fw: Action 280 minOccurs='0' choice branch (was: Re: OCK expression and count of 0 for a choice member....)
by Steve Hanson 18 Sep '15

18 Sep '15

Alex has confirmed that the below solution is acceptable. So we should be able to close this action on next call. Regards Steve Hanson Architect, IBM DFDL Co-Chair, OGF DFDL Working Group IBM SWG, Hursley, UK smh(a)uk.ibm.com tel:+44-1962-815848 ----- Forwarded by Steve Hanson/UK/IBM on 18/09/2015 12:58 ----- From: Steve Hanson/UK/IBM To: Alex Wood1/UK/IBM@IBMGB Cc: Andrew Edwards/UK/IBM@IBMGB Date: 14/09/2015 12:07 Subject: Fw: [DFDL-WG] Action 280 minOccurs='0' choice branch (was: Re: OCK expression and count of 0 for a choice member....) Hi Alex Mike is good with the proposal below. Are you also happy with it, as you raised the original issue? Regards Steve Hanson Architect, IBM DFDL Co-Chair, OGF DFDL Working Group IBM SWG, Hursley, UK smh(a)uk.ibm.com tel:+44-1962-815848 ----- Forwarded by Steve Hanson/UK/IBM on 25/08/2015 17:52 ----- From: Steve Hanson/UK/IBM To: Mike Beckerle <mbeckerle.dfdl(a)gmail.com> Cc: "dfdl-wg(a)ogf.org" <dfdl-wg(a)ogf.org> Date: 25/08/2015 10:24 Subject: Re: [DFDL-WG] Action 280 minOccurs='0' choice branch (was: Re: OCK expression and count of 0 for a choice member....) My thoughts on this... The existing choice branch rule that says minOccurs must not be 0 should remain, for consistency with not allowing minOccurs 0 on the choice itself. Choice branch with dfdl:occursCountKind 'expression' should be allowed. If the expression resolves to 0 then there are no occurrences and the branch is missing, so the parser looks for the next branch. This preserves the rule that a branch must exist. Choice branch with dfdl:occursCountKind 'parsed' should be allowed. If the parser does not find any occurrences then the branch is missing, so the parser looks for the next branch. This preserves the rule that a branch must exist. dfdl;inputValueCalc on a choice branch should be allowed. If the parser reaches such a branch, it discriminates the choice and no further branches are examined. Regards Steve Hanson Architect, IBM DFDL Co-Chair, OGF DFDL Working Group IBM SWG, Hursley, UK smh(a)uk.ibm.com tel:+44-1962-815848 From: Steve Hanson/UK/IBM To: Mike Beckerle <mbeckerle.dfdl(a)gmail.com> Cc: "dfdl-wg(a)ogf.org" <dfdl-wg(a)ogf.org> Date: 11/08/2015 15:58 Subject: Re: [DFDL-WG] Action 280 minOccurs='0' choice branch (was: Re: OCK expression and count of 0 for a choice member....) I may have thought of the reason. If I have a choice of A and B, then minOccurs=0 for B allows the choice to be empty A|B? but this is the same as (A|B)? which is allowing the choice itself to be minOccurs=0, which is not allowed. Regards Steve Hanson Architect, IBM DFDL Co-Chair, OGF DFDL Working Group IBM SWG, Hursley, UK smh(a)uk.ibm.com tel:+44-1962-815848 From: Steve Hanson/UK/IBM To: Mike Beckerle <mbeckerle.dfdl(a)gmail.com> Cc: "dfdl-wg(a)ogf.org" <dfdl-wg(a)ogf.org> Date: 18/06/2015 10:49 Subject: Re: [DFDL-WG] Action 280 minOccurs='0' choice branch (was: Re: OCK expression and count of 0 for a choice member....) Hi Mike I think the restriction of having minOccurs >= 1 on xs:choice branch arose for two reasons, though I am unable to find a definitive email trail: a) If minOccurs = 0 you immediately have two points of uncertainty, so potentially two discriminators are needed. I'm not sure if this is really a problem though, because if minOccurs < maxOccurs there are also two points of uncertainty and it still requires some thought to get discrimination correct as it varies per occurrence. b) Interaction with known-to-exist rules. For example, one way to achieve known-to-exist is to successfully parse an empty representation, which with minOccurs = 0 may mean that nothing is added to the infoset. I'm not sure this is actually a problem though. If the branch was successfully parsed then surely that should discriminate in favour of the branch regardless of representation. And even if a) and b) are problematic, the fact exists that you can trivially negate the restriction by wrapping in xs:sequence. So I suspect we can drop the restriction altogether, and the 'system' just works in a consistent manner. You raised the issue of an element with dfdl:inputValueCalc not being allowed as a choice branch. I suspect this was added because as soon as you encounter such as branch you have by definition discriminated in favour of that branch. But that's ok, you just make that branch the last in the choice. No different to having a branch that exists just to throw an error - it too must be last. If such branches are not last, it's a schema design bug. Back to Alex's original scenario at the foot of this thread, where his xs:choice branch element had a dfdl:occursCount expression that evaluated to 0. According to https://redmine.ogf.org/issues/244 no occurrences are looked for in the data. That means the occurrences are missing, so known-not-to-exist and the parser should try the next branch. Below I said that section 15.1.1 needed updating to correctly reflect section 9. And I also said we are perhaps missing a definition of what 'missing' means for an array element? "(The) spec defines known-to-exist and known-not-to-exist in terms of occurrences. In (Alex's) choice branch example, it is the element as a whole we are looking at. That's fine for scalar as element == occurrence but for an array it's not the same. I think the spec is missing a definition of what 'missing' means for an array element. I would say that an array element is missing if all occurrences are missing. And an array element is not missing if any occurrence has a representation (empty, nil, normal)." Regards Steve Hanson Architect, IBM DFDL Co-Chair, OGF DFDL Working Group IBM SWG, Hursley, UK smh(a)uk.ibm.com tel:+44-1962-815848 From: Mike Beckerle <mbeckerle.dfdl(a)gmail.com> To: "dfdl-wg(a)ogf.org" <dfdl-wg(a)ogf.org> Date: 02/06/2015 18:41 Subject: [DFDL-WG] Action 280 minOccurs='0' choice branch (was: Re: OCK expression and count of 0 for a choice member....) Sent by: dfdl-wg-bounces(a)ogf.org I believe this action item remains open still and I would like to revive the discussion. I was coding up this aspect of Daffodil and have hit this subject head on. In section 15 the spec clearly states that the root of a choice branch cannot be optional, that is cannot have minOccurs="0". That language is very specific, and it leaves open the possibility of "effectively optional" things being the roots of choice branches (e.g., using OCK 'parsed' or 'expression') It also allows one to trivially wrap a sequence (having no delimiters, alignment or skips) around an element (or element ref) carrying minOccurs="0" so as to simply dodge the restriction. It was observed in the thread below that we cannot require choice branches to be scalar elements as there is a need for hidden groups to be branches of choices, and for empty sequences carrying only asserts, as another non-element example. Related: the DFDL spec also specifies that an element that is the root of a choice branch cannot carry dfdl:inputValueCalc. The spec does NOT restrict use of dfdl:outputValueCalc on the root of a choice branch, but the meaning of such is unclear to me. The existing restriction of "no minOccurs="0" on the root of a choice branch seems not to accomplish anything. It is only for occursCountKind='implicit' where this can be meaningful it seems. Requiring the root of a choice branch to not be "variable occurrence" if it is an element would accomplish something, but it is not clear this is needed to eliminate ambiguity or if the ambiguity can be eliminated without any restriction. The stable design points I can think of are: 1) root of a choice branch must be scalar (so, only a sequence, choice, or an element where minOccurs == maxOccurs == 1.) 2) root of a choice branch cannot be optional - for a broad sense of the word optional - precludes arrays with OCK expression and parsed, and implicit if minOccurs="0". Fixed length arrays would be allowed. 3) a choice branch must have some syntax I think we discarded (3) because choice branches that really just reflect error checking - contain only dfdl:asserts for example - are in use and serve a useful purpose. Daffodil's test suite has much use of choice branches that look like this: <choicie> ..... <sequence> <element name="foo" dfdl:inputValueCalc="{....}"/> </sequence> </choice> These have no syntax. This allowing a kind of default-element to be computed. In most (could be all, I've not searched exhaustively) of these cases the IVC expression is a constant. But note that the sequence wrapped around the IVC element is just dodging the restriction that a choice branch cannot be an IVC element (which is another restriction that seems unnecessary.) ...mike Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology | www.tresys.com Please note: Contributions to the DFDL Workgroup's email discussions are subject to the OGF Intellectual Property Policy On Mon, Apr 27, 2015 at 9:30 AM, Steve Hanson <smh(a)uk.ibm.com> wrote: Mike A couple of comments: 1) You said below Optional here means "not required by the DFDL format", as in occursCountKind cannot be 'parsed' at all, because all occurrences are then not required, and the min/maxOccurs are only examined for validation purposes, also occursCountKind cannot be 'implicit' for the same reasons, and occursCountKind 'expression' also. OccursCountKind 'implicit' is allowed, because minOccurs is used for parsing and micOccurs can not be 0. 2) You said below Wrapping the array element in a sequence doesn't solve the problem unless the sequence has a required piece of syntax such as an initiator or terminator, or a hiddenGroupRef to a not-optional (recursively) thing. A sequence has minOccurs '1' so it does satisfy the spec rule about the child of a choice being required. Such a sequence could have no syntax and could contain an element with minOccurs '0' or even be empty. I have seen DFDL schemas that contain a choice with the last branch being an empty sequence that contains an assert fn:false() in order to throw a processing error. Regards Steve Hanson Architect, IBM DFDL Co-Chair, OGF DFDL Working Group IBM SWG, Hursley, UK smh(a)uk.ibm.com tel:+44-1962-815848 From: Mike Beckerle <mbeckerle.dfdl(a)gmail.com> To: Alex Wood1/UK/IBM@IBMGB Cc: "dfdl-wg(a)ogf.org" <dfdl-wg(a)ogf.org> Date: 27/04/2015 13:35 Subject: Re: [DFDL-WG] OCK expression and count of 0 for a choice member.... Sent by: dfdl-wg-bounces(a)ogf.org I believe any use of occursCountKind 'expression' on an element that is the first element on a branch of a choice should be an SDE. This is one of the cases where DFDL requires one to introduce an element that would not be necessary in an ordinary XML schema, but is necessary because DFDL does not have XML's easily parsed syntax to depend on. This is my opinion. I think we need to look at whether this restriction is either (a) necessary (b) necessary to avoid excessive complexity in implementations (c) unnecessary - but is the intention of what is specified already (despite shortcomings of the prose/description in the spec, which could be corrected.) (d) an error in the specification Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology | www.tresys.com Please note: Contributions to the DFDL Workgroup's email discussions are subject to the OGF Intellectual Property Policy On Mon, Apr 27, 2015 at 5:49 AM, Alex Wood1 <WOODA(a)uk.ibm.com> wrote: Hi Mike, Can you clarify if you are saying that OCK expression should be prohibited completely on a choice member (as occurrences for OCK expression are potentially optional regardless of minOccurs value) Or is your statement that it should cause an SDE specific to the count==0 case? Kind Regards, - Alex Alex Wood - Software Engineer - WebSphere Message Broker Development DFDL Development MP 211, IBM UK Labs, Hursley Park, Winchester, Hants. SO21 2JN. Tel: Internal 246272, External 01962 816272 Notes: Alex Wood1/UK/IBM@IBMGB e-mail: wooda(a)uk.ibm.com From: Mike Beckerle <mbeckerle.dfdl(a)gmail.com> To: Alex Wood1/UK/IBM@IBMGB Date: 24/04/2015 15:10 Subject: Re: [DFDL-WG] OCK expression and count of 0 for a choice member.... I think this is an SDE. Choice branches cannot be optional. Optional here, does not mean minOccurs == 0, because for many occursCountKinds, that's never checked unless validation is on, and validation doesn't guide parsing anyway. Optional here means "not required by the DFDL format", as in occursCountKind cannot be 'parsed' at all, because all occurrences are then not required, and the min/maxOccurs are only examined for validation purposes, also occursCountKind cannot be 'implicit' for the same reasons, and occursCountKind 'expression' also. Wrapping the array element in a sequence doesn't solve the problem unless the sequence has a required piece of syntax such as an initiator or terminator, or a hiddenGroupRef to a not-optional (recursively) thing. Even initiator and terminator are tricky, because in a non-delimited format, those can be %WSP*; which can match nothing at all; hence, they do not "require" any syntax. Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology | www.tresys.com Please note: Contributions to the DFDL Workgroup's email discussions are subject to the OGF Intellectual Property Policy On Fri, Apr 24, 2015 at 9:07 AM, Alex Wood1 <WOODA(a)uk.ibm.com> wrote: Hi All, Please see below for a history of the issue. This arose from fuzz testing of the IBM DFDL parser which produced a test with a coutn of 0 for an OCK expression array which was a choice member. And subsequent reference to the specification. It was not clear what the correct outcome should be in a choice where the first member is an array using OCK expression where the count resolves to 0. a.) resolve the choice to the zero length array b.) move to the next choice branch c.) throw an error Kind Regards, - Alex Alex Wood - Software Engineer - WebSphere Message Broker Development DFDL Development MP 211, IBM UK Labs, Hursley Park, Winchester, Hants. SO21 2JN. Tel: Internal 246272, External 01962 816272 Notes: Alex Wood1/UK/IBM@IBMGB e-mail: wooda(a)uk.ibm.com From: Steve Hanson/UK/IBM To: Alex Wood1/UK/IBM@IBMGB Cc: Andrew Edwards/UK/IBM@IBMGB, Mark Frost/UK/IBM Date: 24/04/2015 09:19 Subject: Re: OCK expression and count of 0 for a choice member.... When I wrote the paragraph below, the one thing that troubled me was that the spec defines known-to-exist and known-not-to-exist in terms of occurrences. In the choice branch example, it is the element as a whole we are looking at. That's fine for scalar as element == occurrence but for an array it's not the same. I think the spec is missing a definition of what 'missing' means for an array element. I would say that an array element is missing if all occurrences are missing. And an array element is not missing if any occurrence has a representation (empty, nil, normal). With that in place, my paragraph makes sense, I think. I believe we have the same issue with 'parsed' and 'stopValue'. Regards Steve Hanson Architect, IBM DFDL Co-Chair, OGF DFDL Working Group IBM SWG, Hursley, UK smh(a)uk.ibm.com tel:+44-1962-815848 From: Steve Hanson/UK/IBM To: Alex Wood1/UK/IBM@IBMGB Cc: Andrew Edwards/UK/IBM@IBMGB, Mark Frost/UK/IBM@IBMGB Date: 23/04/2015 18:52 Subject: Re: OCK expression and count of 0 for a choice member.... Here is one interpretation... A choice is resolved by parsing the branches until one is known-to-exist as described in section 9.3.3. Section 9.3.1.2 defines known-to-exist (in the absence of a discriminator, initiator or direct dispatch) as an occurrence having empty, nil or normal representation. Section 9.3.1.3 defines known-not-to-exist (again in the absence of a discriminator, initiator or direct dispatchm or an assert) as an occurrence being missing or causing a processing error. If occursCount is zero no occurrences are looked for in the data (erratum 5.9) so the element has no representation and must be missing. Therefore a choice branch containing such an element is known-not-to-exist. So in your example, the first choice branch containing myInt is known-not-to-exist and the parser tries the next branch. This appears to contradict section 15.1.1 though. I suspect that 15.1.1 was not updated to match section 9.3 when the latter was added. If you want to make the first choice branch known-to-exist when the count is zero then I think wrapping myInt in a sequence would work. Or wrapping myInt in a complex element. Definitely one to take to the WG though, if only to correct section 15.1.1 to match section 9. Regards Steve Hanson Architect, IBM DFDL Co-Chair, OGF DFDL Working Group IBM SWG, Hursley, UK smh(a)uk.ibm.com tel:+44-1962-815848 From: Alex Wood1/UK/IBM To: Steve Hanson/UK/IBM@IBMGB Cc: Andrew Edwards/UK/IBM@IBMGB, Mark Frost/UK/IBM@IBMGB Date: 23/04/2015 16:33 Subject: OCK expression and count of 0 for a choice member.... Hi Steve Just been discussing this with Andy and Mark. I think the spec <xs:element name="Choice_Expression" dfdl:ref="config" dfdl:lengthKind="implicit"> <xs:complexType> <xs:sequence dfdl:ref="config"> <xs:element ref="myCount"></xs:element> <xs:choice dfdl:choiceLengthKind="implicit" dfdl:ref="config"> <xs:element ref="myInt" minOccurs="1" maxOccurs="3"></xs:element> <xs:element ref="myTxt"></xs:element> </xs:choice> </xs:sequence> </xs:complexType> </xs:element> Where myInt has occursCountKind="expression" occursCount="{../myCount}" A given instance of this message could have myCount==0 Is this valid? Should it resolve to 0 occurrences of myInt or move on to myTxt ? Section15 of the spec says: The Root of the Branch MUST NOT be optional. That is XSDL minOccurs MUST BE greater than 0. But in this case minOccurs is >0. Assuming this is not an error then in terms of resolving the choice section 15.1.1 says.. 15.1.1 Resolving Choices via Speculation Speculative resolution works as follows: 1) Attempt to parse the first branch of the choice. 2) If this fails with a processing error a) If a dfdl:discriminator evaluated to true earlier on this branch then the parser is 'bound' to this branch and parsing of the entire choice construct fails with a processing error. b) If the branch has a dfdl:initiator and the choice has dfdl:initiatedContent ‘yes’ then the parser is 'bound' to this branch and parsing of the entire choice construct fails with a processing error. c) Otherwise we repeat from step 1 for the next branch of the choice. 3) It is a processing error if the branches of the choice are exhausted. 4) If a branch is successfully parsed without error, then that branch's infoset becomes the infoset for the parse of the choice construct. So seems like this is 4.) we did not fail to parse myInt... However talking with mark about real scenarios that this might apply to, a choice two repeating fields with counts earlier in the data only one of which must appear. you'd expect 0 of the first means >0 of the second and visa versa... So you'd probably want 0 myInt allowed the choice to resolve to myTxt. Thoughts ? If you agree we need more clarity in he spec will forward to WG. Kind Regards, - Alex Alex Wood - Software Engineer - WebSphere Message Broker Development DFDL Development MP 211, IBM UK Labs, Hursley Park, Winchester, Hants. SO21 2JN. Tel: Internal 246272, External 01962 816272 Notes: Alex Wood1/UK/IBM@IBMGB e-mail: wooda(a)uk.ibm.com Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU -- dfdl-wg mailing list dfdl-wg(a)ogf.org https://www.ogf.org/mailman/listinfo/dfdl-wg Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU -- dfdl-wg mailing list dfdl-wg(a)ogf.org https://www.ogf.org/mailman/listinfo/dfdl-wg Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU -- dfdl-wg mailing list dfdl-wg(a)ogf.org https://www.ogf.org/mailman/listinfo/dfdl-wg Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU

1 0

Re: [DFDL-WG] Action 283: Provision for fallback mappings
by Andrew Edwards 14 Sep '15

14 Sep '15

I dunno, "errorWithFallback" implies to me that both will happen; that we would report an error and then try the fallback mapping. How about: 1) alwaysError 2) alwaysReplace 3) fallbackOrError 4) fallbackOrReplace I'm not too bothered though so happy to go with your names if we're all happy. Cheers, Andy Andy Edwards - IBM Integration Bus - DFDL Email: andy.edwards(a)uk.ibm.com Snail Mail: MP211, Hursley park, Hursley, WINCHESTER, Hants, SO21 2JN Tel int: 247222 Tel ext: +44 (0)1962 817222 Desk: DE3 V17 The Feynman problem solving Algorithm 1) Write down the problem 2) Think real hard 3) Write down the answer -- Murray Gell-mann in the NY Times From: Steve Hanson/UK/IBM To: Andrew Edwards/UK/IBM@IBMGB Cc: Mike Beckerle <mbeckerle.dfdl(a)gmail.com>, DFDL-WG <dfdl-wg(a)ogf.org> Date: 14/09/2015 12:03 Subject: Re: [DFDL-WG] Action 283: Provision for fallback mappings How about 1) Error unmappable characters; fallbacks not required => "error" 2) Replace unmappable characters; fallbacks not required => "replace" 3) Error unmappable characters; fallbacks required => "errorWithFallback" 4) Replace unmappable characters; fallbacks required => "replaceWithFallback" As I understand it, fallback is only applicable when unparsing (from Unicode to codepage). I assume that in this case "fallbackOrError" behaves like "error" and "fallbackOrReplace" behaves like "replace" and that we'd explicitly state in the spec that this is the case. Correct. Regards Steve Hanson Architect, IBM DFDL Co-Chair, OGF DFDL Working Group IBM SWG, Hursley, UK smh(a)uk.ibm.com tel:+44-1962-815848 From: Andrew Edwards/UK/IBM To: Steve Hanson/UK/IBM@IBMGB Cc: Mike Beckerle <mbeckerle.dfdl(a)gmail.com>, DFDL-WG <dfdl-wg(a)ogf.org> Date: 08/09/2015 16:51 Subject: Re: [DFDL-WG] Action 283: Provision for fallback mappings I'm in favour of extra enumerations on dfdl:encodingErrorPolicy. Could we be more verbose on the fallback cases? So we'd have: 1) Error unmappable characters; fallbacks not required => "error" 2) Replace unmappable characters; fallbacks not required => "replace" 3) Error unmappable characters; fallbacks required => "fallbackOrError" 4) Replace unmappable characters; fallbacks required => "fallbackOrReplace" As I understand it, fallback is only applicable when unparsing (from Unicode to codepage). I assume that in this case "fallbackOrError" behaves like "error" and "fallbackOrReplace" behaves like "replace" and that we'd explicitly state in the spec that this is the case. Cheers, Andy Andy Edwards - IBM Integration Bus - DFDL Email: andy.edwards(a)uk.ibm.com Snail Mail: MP211, Hursley park, Hursley, WINCHESTER, Hants, SO21 2JN Tel int: 247222 Tel ext: +44 (0)1962 817222 Desk: DE3 V17 The Feynman problem solving Algorithm 1) Write down the problem 2) Think real hard 3) Write down the answer -- Murray Gell-mann in the NY Times From: Steve Hanson/UK/IBM@IBMGB To: Mike Beckerle <mbeckerle.dfdl(a)gmail.com> Cc: DFDL-WG <dfdl-wg(a)ogf.org> Date: 27/08/2015 09:51 Subject: Re: [DFDL-WG] Action 283: Provision for fallback mappings Sent by: dfdl-wg-bounces(a)ogf.org It's obviously less disruptive to the DFDL spec to add extra enums to dfdl:encodingErrorPolicy. My concern in doing that is the orthogonality of substitition characters (an error has occurred) and fallbacks (defined mappings for a purpose). So let's look at the scenarios we need to support and see if that can generate a set of reasonably natural enums: 1) Error unmappable characters; fallbacks not required => "error" 2) Replace unmappable characters; fallbacks not required => "replace" 3) Error unmappable characters; fallbacks required => "fallback" 4) Replace unmappable characters; fallbacks required => "fallbackOrReplace" I think two new enums are needed as one IBM product that uses IBM DFDL said it wanted fallback but not substitution. Regards Steve Hanson Architect, IBM DFDL Co-Chair, OGF DFDL Working Group IBM SWG, Hursley, UK smh(a)uk.ibm.com tel:+44-1962-815848 From: Mike Beckerle <mbeckerle.dfdl(a)gmail.com> To: Steve Hanson/UK/IBM@IBMGB Cc: DFDL-WG <dfdl-wg(a)ogf.org> Date: 26/08/2015 14:32 Subject: Re: [DFDL-WG] Action 283: Provision for fallback mappings Or... perhaps dfdl:encodingErrorPolicy="replaceOrFallback", that is, perhaps we can just add another enum value to reflect this policy rather than adding more properties. Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology | www.tresys.com Please note: Contributions to the DFDL Workgroup's email discussions are subject to the OGF Intellectual Property Policy On Tue, Aug 25, 2015 at 10:56 AM, Mike Beckerle <mbeckerle.dfdl(a)gmail.com> wrote: Would an IBM-specific property, to be proposed for future inclusion in DFDL. E.g., something like ibmdfdl:encodingErrorFallbackPolicy="never" or "fallback" with other enums reserved for the future. I would like to pave a path for these sorts of proposed features. It would be good to see if this alone is sufficient to meet your customer's needs who are asking for this, or whether they will need even a bit more control than this. It looks like we just missed some unparse behavior in dfdl:encodingErrorPolicy="replace", as clearly when a Unicode character has no mapping, and the target encoding is SBCS and ascii-derived, then the 0x1A character is the right thing. However, I know what will happen in Daffodil is what the standard ICU library does, with its default mapping definitions, and I don't know that this 0x1A substitution character is properly used in those mappings. Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology | www.tresys.com Please note: Contributions to the DFDL Workgroup's email discussions are subject to the OGF Intellectual Property Policy On Tue, Aug 25, 2015 at 9:29 AM, Steve Hanson <smh(a)uk.ibm.com> wrote: Today the DFDL 1.0 spec has property dfdl:encodingErrorPolicy to control what happens when an unmappable or malformed character is encountered - 'error' or 'replace'. When 'replace' the appropriate substitution character is used. There is also the orthogonal question of fallback mappings, which are mappings specified by an encoding which is not a normal round-trip mapping. DFDL does not currently provide for switching on fallback mappings. Here's what ICU says about this at http://userguide.icu-project.org/conversion/data. In the CHARMAP section of a .ucm file, each line contains a Unicode code point (like <U(1-6 hexadecimal digits for the code point)> ), a codepage character byte sequence (each byte like \xhh (2 hexadecimal digits} ), and an optional "precision" or "fallback" indicator. The precision indicator either must be present in all mappings or in none of them. The indicator is a pipe symbol ?|? followed by a 0, 1, 2, 3, or 4 that has the following meaning: |0 - A "normal", roundtrip mapping from a Unicode code point and back. |1 - A "fallback" mapping only from Unicode to the codepage, but not back. |2 ? A subchar1 mapping. The code point is unmappable, and if a substitution is performed, then the subchar1 should be used rather than the subchar. Otherwise, such mappings are ignored. |3 - A "reverse fallback" mapping only from the codepage to Unicode, but not back to the codepage. |4 - A "good one-way" mapping only from Unicode to the codepage, but not back. Fallback mappings from Unicode typically do not map codes for the same character, but for "similar" ones. This mapping is sometimes done if a character exists in Unicode but not in the codepage. To replace it, ICU maps a codepage code to a similar-looking code for human-readable output. This mapping feature is not useful for text data transmission especially in markup languages where a Unicode code point can be escaped with its code point value. The ICU application programming interface (API) ucnv_setFallback() controls this fallback behavior. "Reverse fallbacks" are technically similar, but the same Unicode character can be encoded twice in the codepage. ICU always uses reverse fallbacks at runtime. A subset of the fallback mappings from Unicode is always used at runtime: Those that map private-use Unicode code points. Fallbacks from private-use code points are often introduced as replacements for previous roundtrip mappings for the same pair of codes. These replacements are used when a Unicode version assigns a new character that was previously mapped to that private-use code point. The mapping table is then changed to map the same codepage byte sequence to the new Unicode code point (as a new roundtrip) and the mapping from the old private-use code point to the same codepage code is preserved as a fallback. A "good one-way" mapping is like a fallback, but ICU always uses "good one-way" mappings at runtime, regardless of the fallback API flag. The idea is that fallbacks normally lose information, such as mapping from a compatibility variant of a letter to the ASCII version; however, fallbacks from PUA and reverse fallbacks are assumed to be for "the same character", just an older code for it. So the default behaviour for ICU is to use "good one-way" mappings, "reverse fallback" mappings, and "fallback" mappings from private-use-area code points, but only to use normal "fallback" mappings if the setFallback API has been used. IBM customers have requested the ability to use normal "fallback" mappings. At the current time, the only solution open to them is to change the .ucm file (or create a variant) and change the "|1" mappings to "|4" so that "fallback" mappings become "good one-way" mappings. A proposal to support fallbacks was submitted a few years ago by Mike. https://www.ogf.org/pipermail/dfdl-wg/2011-November/001631.html. It proposed adding new DFDL annotations to allow replacement characters and fallback mappings to be specified. This was rejected as ICU already provides this via the .ucm file. But no simpler alternative materialised, and the resulting erratum only added dfdl:encodingErrorPolicy, which does not handle fallbacks. Given a) the precedent of existing IBM DFDL and Daffodil behaviour which (should) match the ICU default, b) the orthogonality of substitition characters (an error has occurred) and fallbacks (defined mappings for a purpose), and b) an IBM recommendation not to switch on fallbacks by default, it feels like we need a new property eg: dfdl:useEncodingFallbacks 'yes' | 'no'. Alternatives welcome. The names dfdl:encodingFallbackPolicy or dfdl:encodingPrecisionPolicy are better, but then comes the problem of finding meaningful enum values... Also noted: The woridng for dfdl:encodingErrorPolicy 'replace' says: If 'replace' then any error when decoding characters results in the insertion of the Unicode Replacement Character (U+FFFD) as the replacement for that error. That is not strictly true, as the same ICU page says: Conversion from a codepage to Unicode occurs and an unassigned codepoint is found 1. If the input sequence is of length 1 and a subchar1 byte is specified for the codepage [in the .ucm file], output U+001A 2. Otherwise output U+FFFD There is then the question of how do the two properties interact. Specifically, if fallbacks are not being used, does encountering a code point with a fallback result dfdl:encodingErrorPolicy coming in to play? I suspect so but needs verifying. Regards Steve Hanson Architect, IBM DFDL Co-Chair, OGF DFDL Working Group IBM SWG, Hursley, UK smh(a)uk.ibm.com tel:+44-1962-815848 Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU -- dfdl-wg mailing list dfdl-wg(a)ogf.org https://www.ogf.org/mailman/listinfo/dfdl-wg Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU -- dfdl-wg mailing list dfdl-wg(a)ogf.org https://www.ogf.org/mailman/listinfo/dfdl-wg Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU

1 0

Re: [DFDL-WG] Action 283: Provision for fallback mappings
by Steve Hanson 14 Sep '15

14 Sep '15

How about 1) Error unmappable characters; fallbacks not required => "error" 2) Replace unmappable characters; fallbacks not required => "replace" 3) Error unmappable characters; fallbacks required => "errorWithFallback" 4) Replace unmappable characters; fallbacks required => "replaceWithFallback" As I understand it, fallback is only applicable when unparsing (from Unicode to codepage). I assume that in this case "fallbackOrError" behaves like "error" and "fallbackOrReplace" behaves like "replace" and that we'd explicitly state in the spec that this is the case. Correct. Regards Steve Hanson Architect, IBM DFDL Co-Chair, OGF DFDL Working Group IBM SWG, Hursley, UK smh(a)uk.ibm.com tel:+44-1962-815848 From: Andrew Edwards/UK/IBM To: Steve Hanson/UK/IBM@IBMGB Cc: Mike Beckerle <mbeckerle.dfdl(a)gmail.com>, DFDL-WG <dfdl-wg(a)ogf.org> Date: 08/09/2015 16:51 Subject: Re: [DFDL-WG] Action 283: Provision for fallback mappings I'm in favour of extra enumerations on dfdl:encodingErrorPolicy. Could we be more verbose on the fallback cases? So we'd have: 1) Error unmappable characters; fallbacks not required => "error" 2) Replace unmappable characters; fallbacks not required => "replace" 3) Error unmappable characters; fallbacks required => "fallbackOrError" 4) Replace unmappable characters; fallbacks required => "fallbackOrReplace" As I understand it, fallback is only applicable when unparsing (from Unicode to codepage). I assume that in this case "fallbackOrError" behaves like "error" and "fallbackOrReplace" behaves like "replace" and that we'd explicitly state in the spec that this is the case. Cheers, Andy Andy Edwards - IBM Integration Bus - DFDL Email: andy.edwards(a)uk.ibm.com Snail Mail: MP211, Hursley park, Hursley, WINCHESTER, Hants, SO21 2JN Tel int: 247222 Tel ext: +44 (0)1962 817222 Desk: DE3 V17 The Feynman problem solving Algorithm 1) Write down the problem 2) Think real hard 3) Write down the answer -- Murray Gell-mann in the NY Times From: Steve Hanson/UK/IBM@IBMGB To: Mike Beckerle <mbeckerle.dfdl(a)gmail.com> Cc: DFDL-WG <dfdl-wg(a)ogf.org> Date: 27/08/2015 09:51 Subject: Re: [DFDL-WG] Action 283: Provision for fallback mappings Sent by: dfdl-wg-bounces(a)ogf.org It's obviously less disruptive to the DFDL spec to add extra enums to dfdl:encodingErrorPolicy. My concern in doing that is the orthogonality of substitition characters (an error has occurred) and fallbacks (defined mappings for a purpose). So let's look at the scenarios we need to support and see if that can generate a set of reasonably natural enums: 1) Error unmappable characters; fallbacks not required => "error" 2) Replace unmappable characters; fallbacks not required => "replace" 3) Error unmappable characters; fallbacks required => "fallback" 4) Replace unmappable characters; fallbacks required => "fallbackOrReplace" I think two new enums are needed as one IBM product that uses IBM DFDL said it wanted fallback but not substitution. Regards Steve Hanson Architect, IBM DFDL Co-Chair, OGF DFDL Working Group IBM SWG, Hursley, UK smh(a)uk.ibm.com tel:+44-1962-815848 From: Mike Beckerle <mbeckerle.dfdl(a)gmail.com> To: Steve Hanson/UK/IBM@IBMGB Cc: DFDL-WG <dfdl-wg(a)ogf.org> Date: 26/08/2015 14:32 Subject: Re: [DFDL-WG] Action 283: Provision for fallback mappings Or... perhaps dfdl:encodingErrorPolicy="replaceOrFallback", that is, perhaps we can just add another enum value to reflect this policy rather than adding more properties. Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology | www.tresys.com Please note: Contributions to the DFDL Workgroup's email discussions are subject to the OGF Intellectual Property Policy On Tue, Aug 25, 2015 at 10:56 AM, Mike Beckerle <mbeckerle.dfdl(a)gmail.com> wrote: Would an IBM-specific property, to be proposed for future inclusion in DFDL. E.g., something like ibmdfdl:encodingErrorFallbackPolicy="never" or "fallback" with other enums reserved for the future. I would like to pave a path for these sorts of proposed features. It would be good to see if this alone is sufficient to meet your customer's needs who are asking for this, or whether they will need even a bit more control than this. It looks like we just missed some unparse behavior in dfdl:encodingErrorPolicy="replace", as clearly when a Unicode character has no mapping, and the target encoding is SBCS and ascii-derived, then the 0x1A character is the right thing. However, I know what will happen in Daffodil is what the standard ICU library does, with its default mapping definitions, and I don't know that this 0x1A substitution character is properly used in those mappings. Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology | www.tresys.com Please note: Contributions to the DFDL Workgroup's email discussions are subject to the OGF Intellectual Property Policy On Tue, Aug 25, 2015 at 9:29 AM, Steve Hanson <smh(a)uk.ibm.com> wrote: Today the DFDL 1.0 spec has property dfdl:encodingErrorPolicy to control what happens when an unmappable or malformed character is encountered - 'error' or 'replace'. When 'replace' the appropriate substitution character is used. There is also the orthogonal question of fallback mappings, which are mappings specified by an encoding which is not a normal round-trip mapping. DFDL does not currently provide for switching on fallback mappings. Here's what ICU says about this at http://userguide.icu-project.org/conversion/data. In the CHARMAP section of a .ucm file, each line contains a Unicode code point (like <U(1-6 hexadecimal digits for the code point)> ), a codepage character byte sequence (each byte like \xhh (2 hexadecimal digits} ), and an optional "precision" or "fallback" indicator. The precision indicator either must be present in all mappings or in none of them. The indicator is a pipe symbol ‘|’ followed by a 0, 1, 2, 3, or 4 that has the following meaning: |0 - A "normal", roundtrip mapping from a Unicode code point and back. |1 - A "fallback" mapping only from Unicode to the codepage, but not back. |2 – A subchar1 mapping. The code point is unmappable, and if a substitution is performed, then the subchar1 should be used rather than the subchar. Otherwise, such mappings are ignored. |3 - A "reverse fallback" mapping only from the codepage to Unicode, but not back to the codepage. |4 - A "good one-way" mapping only from Unicode to the codepage, but not back. Fallback mappings from Unicode typically do not map codes for the same character, but for "similar" ones. This mapping is sometimes done if a character exists in Unicode but not in the codepage. To replace it, ICU maps a codepage code to a similar-looking code for human-readable output. This mapping feature is not useful for text data transmission especially in markup languages where a Unicode code point can be escaped with its code point value. The ICU application programming interface (API) ucnv_setFallback() controls this fallback behavior. "Reverse fallbacks" are technically similar, but the same Unicode character can be encoded twice in the codepage. ICU always uses reverse fallbacks at runtime. A subset of the fallback mappings from Unicode is always used at runtime: Those that map private-use Unicode code points. Fallbacks from private-use code points are often introduced as replacements for previous roundtrip mappings for the same pair of codes. These replacements are used when a Unicode version assigns a new character that was previously mapped to that private-use code point. The mapping table is then changed to map the same codepage byte sequence to the new Unicode code point (as a new roundtrip) and the mapping from the old private-use code point to the same codepage code is preserved as a fallback. A "good one-way" mapping is like a fallback, but ICU always uses "good one-way" mappings at runtime, regardless of the fallback API flag. The idea is that fallbacks normally lose information, such as mapping from a compatibility variant of a letter to the ASCII version; however, fallbacks from PUA and reverse fallbacks are assumed to be for "the same character", just an older code for it. So the default behaviour for ICU is to use "good one-way" mappings, "reverse fallback" mappings, and "fallback" mappings from private-use-area code points, but only to use normal "fallback" mappings if the setFallback API has been used. IBM customers have requested the ability to use normal "fallback" mappings. At the current time, the only solution open to them is to change the .ucm file (or create a variant) and change the "|1" mappings to "|4" so that "fallback" mappings become "good one-way" mappings. A proposal to support fallbacks was submitted a few years ago by Mike. https://www.ogf.org/pipermail/dfdl-wg/2011-November/001631.html. It proposed adding new DFDL annotations to allow replacement characters and fallback mappings to be specified. This was rejected as ICU already provides this via the .ucm file. But no simpler alternative materialised, and the resulting erratum only added dfdl:encodingErrorPolicy, which does not handle fallbacks. Given a) the precedent of existing IBM DFDL and Daffodil behaviour which (should) match the ICU default, b) the orthogonality of substitition characters (an error has occurred) and fallbacks (defined mappings for a purpose), and b) an IBM recommendation not to switch on fallbacks by default, it feels like we need a new property eg: dfdl:useEncodingFallbacks 'yes' | 'no'. Alternatives welcome. The names dfdl:encodingFallbackPolicy or dfdl:encodingPrecisionPolicy are better, but then comes the problem of finding meaningful enum values... Also noted: The woridng for dfdl:encodingErrorPolicy 'replace' says: If 'replace' then any error when decoding characters results in the insertion of the Unicode Replacement Character (U+FFFD) as the replacement for that error. That is not strictly true, as the same ICU page says: Conversion from a codepage to Unicode occurs and an unassigned codepoint is found 1. If the input sequence is of length 1 and a subchar1 byte is specified for the codepage [in the .ucm file], output U+001A 2. Otherwise output U+FFFD There is then the question of how do the two properties interact. Specifically, if fallbacks are not being used, does encountering a code point with a fallback result dfdl:encodingErrorPolicy coming in to play? I suspect so but needs verifying. Regards Steve Hanson Architect, IBM DFDL Co-Chair, OGF DFDL Working Group IBM SWG, Hursley, UK smh(a)uk.ibm.com tel:+44-1962-815848 Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU -- dfdl-wg mailing list dfdl-wg(a)ogf.org https://www.ogf.org/mailman/listinfo/dfdl-wg Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU -- dfdl-wg mailing list dfdl-wg(a)ogf.org https://www.ogf.org/mailman/listinfo/dfdl-wg

1 0

Action 283: Provision for fallback mappings
by Steve Hanson 08 Sep '15

08 Sep '15

Today the DFDL 1.0 spec has property dfdl:encodingErrorPolicy to control what happens when an unmappable or malformed character is encountered - 'error' or 'replace'. When 'replace' the appropriate substitution character is used. There is also the orthogonal question of fallback mappings, which are mappings specified by an encoding which is not a normal round-trip mapping. DFDL does not currently provide for switching on fallback mappings. Here's what ICU says about this at http://userguide.icu-project.org/conversion/data. In the CHARMAP section of a .ucm file, each line contains a Unicode code point (like <U(1-6 hexadecimal digits for the code point)> ), a codepage character byte sequence (each byte like \xhh (2 hexadecimal digits} ), and an optional "precision" or "fallback" indicator. The precision indicator either must be present in all mappings or in none of them. The indicator is a pipe symbol ‘|’ followed by a 0, 1, 2, 3, or 4 that has the following meaning: |0 - A "normal", roundtrip mapping from a Unicode code point and back. |1 - A "fallback" mapping only from Unicode to the codepage, but not back. |2 – A subchar1 mapping. The code point is unmappable, and if a substitution is performed, then the subchar1 should be used rather than the subchar. Otherwise, such mappings are ignored. |3 - A "reverse fallback" mapping only from the codepage to Unicode, but not back to the codepage. |4 - A "good one-way" mapping only from Unicode to the codepage, but not back. Fallback mappings from Unicode typically do not map codes for the same character, but for "similar" ones. This mapping is sometimes done if a character exists in Unicode but not in the codepage. To replace it, ICU maps a codepage code to a similar-looking code for human-readable output. This mapping feature is not useful for text data transmission especially in markup languages where a Unicode code point can be escaped with its code point value. The ICU application programming interface (API) ucnv_setFallback() controls this fallback behavior. "Reverse fallbacks" are technically similar, but the same Unicode character can be encoded twice in the codepage. ICU always uses reverse fallbacks at runtime. A subset of the fallback mappings from Unicode is always used at runtime: Those that map private-use Unicode code points. Fallbacks from private-use code points are often introduced as replacements for previous roundtrip mappings for the same pair of codes. These replacements are used when a Unicode version assigns a new character that was previously mapped to that private-use code point. The mapping table is then changed to map the same codepage byte sequence to the new Unicode code point (as a new roundtrip) and the mapping from the old private-use code point to the same codepage code is preserved as a fallback. A "good one-way" mapping is like a fallback, but ICU always uses "good one-way" mappings at runtime, regardless of the fallback API flag. The idea is that fallbacks normally lose information, such as mapping from a compatibility variant of a letter to the ASCII version; however, fallbacks from PUA and reverse fallbacks are assumed to be for "the same character", just an older code for it. So the default behaviour for ICU is to use "good one-way" mappings, "reverse fallback" mappings, and "fallback" mappings from private-use-area code points, but only to use normal "fallback" mappings if the setFallback API has been used. IBM customers have requested the ability to use normal "fallback" mappings. At the current time, the only solution open to them is to change the .ucm file (or create a variant) and change the "|1" mappings to "|4" so that "fallback" mappings become "good one-way" mappings. A proposal to support fallbacks was submitted a few years ago by Mike. https://www.ogf.org/pipermail/dfdl-wg/2011-November/001631.html. It proposed adding new DFDL annotations to allow replacement characters and fallback mappings to be specified. This was rejected as ICU already provides this via the .ucm file. But no simpler alternative materialised, and the resulting erratum only added dfdl:encodingErrorPolicy, which does not handle fallbacks. Given a) the precedent of existing IBM DFDL and Daffodil behaviour which (should) match the ICU default, b) the orthogonality of substitition characters (an error has occurred) and fallbacks (defined mappings for a purpose), and b) an IBM recommendation not to switch on fallbacks by default, it feels like we need a new property eg: dfdl:useEncodingFallbacks 'yes' | 'no'. Alternatives welcome. The names dfdl:encodingFallbackPolicy or dfdl:encodingPrecisionPolicy are better, but then comes the problem of finding meaningful enum values... Also noted: The woridng for dfdl:encodingErrorPolicy 'replace' says: If 'replace' then any error when decoding characters results in the insertion of the Unicode Replacement Character (U+FFFD) as the replacement for that error. That is not strictly true, as the same ICU page says: Conversion from a codepage to Unicode occurs and an unassigned codepoint is found 1. If the input sequence is of length 1 and a subchar1 byte is specified for the codepage [in the .ucm file], output U+001A 2. Otherwise output U+FFFD There is then the question of how do the two properties interact. Specifically, if fallbacks are not being used, does encountering a code point with a fallback result dfdl:encodingErrorPolicy coming in to play? I suspect so but needs verifying. Regards Steve Hanson Architect, IBM DFDL Co-Chair, OGF DFDL Working Group IBM SWG, Hursley, UK smh(a)uk.ibm.com tel:+44-1962-815848 Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU

3 4