September 2010 - dfdl-wg

Re: [DFDL-WG] Fw: Hidden elements - summary of approaches
by Alan Powell 15 Sep '10

15 Sep '10

All Proposed section on hidden sequences: The current dfdl:hidden annotation section is removed 14.5 Hidden Sequence Groups Some fields in the physical stream provide information about other fields in the stream and are not really part of the data. For example, a field could give the number of repeats in a following array. These fields may not be of interest to an application so may be removed from the Infoset on parsing by marking them as hidden. A hidden sequence group allows fields to be defined that will not be added to the infoset on parsing and will not be expected in the Infoset on unparsing. <xs:element name="root"> <xs:complexType> <xs:sequence> <xs:element name="firstElement" type="xs:int" <xs:sequence> <dfdl:sequence hiddenGroupRef="tns:hiddenRepeatCount"> </xs:sequence> <xs:element name="arrayElement" type="xs:int" minOccurs="0" maxOccurs="unbounded" dfdl:occursCountKind=?expression? dfdl:occurCount= ?{./repeatCount}? /> </xs:sequence> </xs:complexType> </xs:element> <xs:group name="hiddenRepeatCount" > <xs:sequence> <xs:element name="repeatCount" type=int dfdl:outputValueCalc=?{count(./arrayElement)}? dfdl:representation=?binary? dfdl:lengthKind=?implicit? /> </xs:sequence> </xs:group> Hidden elements within a hidden sequence can be referenced via path expressions using the same DFDL expression that we would have if it were not hidden. Hidden elements can (typically will) contain the regular DFDL annotations to define their physical properties and on unparsing to set their value. They are processed using the same behavior as non-hidden elements. When the dfdl:hiddenGroupRef property is specified, all other DFDL are ignored. It is a schema definition error if the sequence is not empty. A hidden sequence may appear within another hidden sequence. Property Name Description hiddenGroupRef QName Reference to a global model group definition that defines the hidden element or elements. The model group within the model group definition must be a sequence Annotation: dfdl:sequence Table 11 Hidden sequence properties Regards Alan Powell Development - MQSeries, Message Broker, ESB IBM Software Group, Application and Integration Middleware Software ------------------------------------------------------------------------------------------------------------------------------------------- IBM MP211, Hursley Park Hursley, SO21 2JN United Kingdom Phone: +44-1962-815073 e-mail: alan_powell(a)uk.ibm.com From: Steve Hanson/UK/IBM To: remcgrat(a)illinois.edu Cc: alejandr(a)ncsa.illinois.edu, Suman Kalia/Toronto/IBM@IBMCA, Alan Powell/UK/IBM@IBMGB, Stephanie Fetzer/Charlotte/IBM@IBMUS, Tim Kimber/UK/IBM@IBMGB, Sandy Gao/Toronto/IBM@IBMCA Date: 08/09/2010 17:13 Subject: Fw: Hidden elements - summary of approaches Hi Bob Alejandro included two extensions to the DFDL hidden syntax. Here's the current spec syntax for making a local element called 'repeat count' hidden (exactly same syntax for element ref, sequence, choice, or group ref) <xs:element name="root"> <xs:complexType> <xs:sequence> <xs:sequence> <xs:annotation><xs:appinfo source=http://www.ogf.org/dfdl/" /> <dfdl:hidden groupref="tns:hiddenRepeatCount"> </xs:appinfo></xs:annotation> </xs:sequence> <xs:element name="array" type="xs:string" maxOccurs="unbounded" dfdl:occursCountKind=?expression? dfdl:occurCount= ?{./repeatCount}? /> </xs:sequence> </xs:complexType> </xs:element> <xs:group name="hiddenRepeatCount" > <xs:sequence> <xs:element name="repeatCount" type="int" dfdl:representation=?binary? dfdl:lengthKind=?implicit? /> </xs:sequence> </xs:group> 1) Hiding a local element Here's what I think Alejandro has added in Daffodil, as an optimised syntax for hiding a local element. <xs:element name="root"> <xs:complexType> <xs:sequence> <xs:sequence> <xs:annotation><xs:appinfo source=http://www.ogf.org/dfdl/" /> <dfdl:hidden elementref="tns:repeatCount"> </xs:appinfo></xs:annotation> </xs:sequence> <xs:element name="array" type="xs:string" maxOccurs="unbounded" dfdl:occursCountKind=?expression? dfdl:occurCount= ?{./repeatCount}? /> </xs:sequence> </xs:complexType> </xs:element> <xs:element name="repeatCount" type="int" dfdl:representation=?binary? dfdl:lengthKind=?implicit? /> There's a restriction though - it can only be used when minOccurs=maxOccurs=1, because we have actually lost the XML Schema particle. There's no loss of DFDL semantic as far as I am aware, because we do not have particle-specific properties. Applying my proposed simplified syntax to Alejandro's optimisation gives the syntax below. <xs:element name="root"> <xs:complexType> <xs:sequence> <xs:sequence dfdl:hiddenElementRef="tns:repeatCount" /> <xs:element name="array" type="xs:string" maxOccurs="unbounded" dfdl:occursCountKind=?expression? dfdl:occurCount= ?{./repeatCount}? /> </xs:sequence> </xs:complexType> </xs:element> <xs:element name="repeatCount" type="int" dfdl:representation=?binary? dfdl:lengthKind=?implicit? /> This is quite compact, and does make the act of hiding easier. But it can only be used under specific circumstances. Requires new property dfdl:hiddenElementRef, or (better) we rename dfdl:hiddenGroupRef to dfdl:hiddenRef and allow it to point at groups or elements. It also allows the hiding of an element reference with minOccurs=maxOccurs=1 and no explicit DFDL properties to be achieved without creating a global group. (I can hide a group reference with no explicit DFDL properties in this manner today). 2) Hiding a local choice If the object to be hidden is a choice, I think Alejandro is allowing the xs:choice to be the content of the global group, instead of requiring the sequence to wrap it. <xs:group name="hiddenRepeatCount" > <xs:choice> <xs:element name="repeatCount" type="int" dfdl:representation=?binary? dfdl:lengthKind=?implicit? /> <xs:element name="repeatString" type="int" dfdl:representation=?text? dfdl:lengthKind=?explicit? dfdl:length="10" /> </xs:choice> </xs:group> Makes sense - if I was hiding a local sequence, I wouldn't bother to wrap the sequence in yet another sequence in the global group, so why do so with a choice? Thoughts welcome. Personally I like both. Regards Steve Hanson Strategy, Common Transformation & DFDL Co-Chair, OGF DFDL WG IBM SWG, Hursley, UK, smh(a)uk.ibm.com, tel +44-(0)1962-815848 ----- Forwarded by Steve Hanson/UK/IBM on 08/09/2010 15:54 ----- From: Steve Hanson/UK/IBM To: Sandy Gao/Toronto/IBM@IBMCA Cc: Alan Powell/UK/IBM@IBMGB, Michael Hudson/Boca Raton/IBM@IBMUS, Richard Schofield/UK/IBM@IBMGB, Stephanie Fetzer/Charlotte/IBM@IBMUS, Suman Kalia/Toronto/IBM@IBMCA, Tim Kimber/UK/IBM@IBMGB, dfdl-wg(a)ogf.org Date: 08/09/2010 10:59 Subject: Hidden elements - summary of approaches Let's state the two options being considered, as I said I'd do this for the wider DFDL WG for the call today: 1) Global group approach Summary: Particle to hide can be a local element, element ref, local sequence, local choice or group ref Particle is removed from its parent into a dedicated global group of composition sequence and replaced in the parent by a new empty local sequence The new empty local sequence carries a dfdl:hidden annotation that has a property dfdl:groupRef, other DFDL properties are not allowed Alternatively, the new empty local sequence carries a dfdl:hiddenGroupRef property, other DFDL properties are not allowed Pros: Removal of all DFDL annotations and use of the resultant pure XSD results in same infoset Global group can be reused Cons: Making something hidden is a refactor operation Global group sequence needs DFDL properties setting correctly 2) Hidden flag approach Summary: Particle to hide can be a local element, element ref Particle takes a dfdl:hidden property xs:minOccurs MUST be 0 A dfdl:minOccurs property takes the place of xs:minOccurs. Pros: Easy to make something hidden Cons: Removal of all DFDL annotations and using pure XSD does not guarantee the same infoset Breaks validation Duplication of minOccurs property Have to wrap a local sequence, choice or group ref in a complex element in order to hide it (they can't take minOccurs = 0) Regards Steve Hanson Strategy, Common Transformation & DFDL Co-Chair, OGF DFDL WG IBM SWG, Hursley, UK, smh(a)uk.ibm.com, tel +44-(0)1962-815848 Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU

1 0

Minute for OGF DFDL Working Group Call, September 8-2010
by Alan Powell 15 Sep '10

15 Sep '10

Open Grid Forum: Data Format Description Language Working Group OGF DFDL Working Group Call, September 8-2010 Attendees Stephanie Fetzer (IBM) Steve Hanson (IBM) Bob McGrath (National Center for Supercomputing Applications) Alan Powell (IBM) Suman Kalia (IBM) Apologies Mike Beckerle (Oco) Tim Kimber(IBM) Alejandro Rodriguez (National Center for Supercomputing Applications) 1. Current Actions Updated Below 2. xs:minLength The spec currently states When an element declaration specifies a default value, and has type xs:string, then xs:minLength must be specified and must be 1 or greater. It is a schema definition error otherwise. The process for defaults and nils means this restriction is no longer needed. Agreed Closed 3. Is UTF-16 a fixed width or variable width encoding Appendix A: About UTF-16 and Unicode Character Codes above 0xFFFF When we define UTF-16 to be a fixed-width double-byte wide character set we say that each UTF-16 codepoint is represented by 2 bytes. Notice the careful use of the term 'codepoint' here. Unicode/ISO10646 characters can have character codes as large as 0x10FFFF which requires 3 bytes to store (21 bits actually); however in UTF-16 characters with more than 2 bytes of code are encoded as two codepoints, called a surrogate pair; hence, UTF-16 is fixed-width, 2 bytes per codepoint. It is not 2 bytes per Unicode character. UTF-16 is really a variable-width encoding, but the characters that require the surrogate-pair treatment are so infrequently used that UTF-16 is most often treated like a 16-bit fixed-width character set. It is the acknowledgement of the existence of surrogate pairs that leads to the ?codepoint? vs. ?character code? distinction. UTF-32 is a fixed width encoding with a full 4-bytes per character code. It represents all of Unicode with the same width per character. Hence, when we refer to lengths in character strings we will often refer to length in characters, but we qualify that it means 2-byte codepoints when the character set encoding is UTF-16. Hence, when the property lengthUnitKind is 'characters' and the charset is 'UTF-16', then the units are actually 16-bit codepoints, not Unicode characters. Proposal -UCS2 is a fixed length encoding -UTF-16 is a variable width encoding. - A new property dfdl:UTF16Fixed 'yes ¦ no' treat UTF-16 as a fixed width encoding Action raised Meeting closed, 16:35 Next call Wednesday 15 September 2010 15:00 UK (10:00 ET) Next action: 117 Actions raised at this meeting No Action 117 3. Is UTF-16 a fixed width or variable width encoding Appendix A: About UTF-16 and Unicode Character Codes above 0xFFFF When we define UTF-16 to be a fixed-width double-byte wide character set we say that each UTF-16 codepoint is represented by 2 bytes. Notice the careful use of the term 'codepoint' here. Unicode/ISO10646 characters can have character codes as large as 0x10FFFF which requires 3 bytes to store (21 bits actually); however in UTF-16 characters with more than 2 bytes of code are encoded as two codepoints, called a surrogate pair; hence, UTF-16 is fixed-width, 2 bytes per codepoint. It is not 2 bytes per Unicode character. UTF-16 is really a variable-width encoding, but the characters that require the surrogate-pair treatment are so infrequently used that UTF-16 is most often treated like a 16-bit fixed-width character set. It is the acknowledgement of the existence of surrogate pairs that leads to the ?codepoint? vs. ?character code? distinction. UTF-32 is a fixed width encoding with a full 4-bytes per character code. It represents all of Unicode with the same width per character. Hence, when we refer to lengths in character strings we will often refer to length in characters, but we qualify that it means 2-byte codepoints when the character set encoding is UTF-16. Hence, when the property lengthUnitKind is 'characters' and the charset is 'UTF-16', then the units are actually 16-bit codepoints, not Unicode characters. Proposal -UCS2 is a fixed length encoding -UTF-16 is a variable width encoding. - A new property dfdl:UTF16Fixed 'yes ¦ no' treat UTF-16 as a fixed width encoding Current Actions: No Action 066 Investigate format for defining test cases 25/11:IBM to see if it is possible to publish its test case format. 04/12: no update ... 17/02: IBM is willing in principle to publish the test case format and some of the test cases. May need some time to build a 'compliance suite' 24/03: No progress 03/03: Discussions have been taking place on the subset of tests that will be provided. 10/03: work is progressing 17/03: work is progressing 31/03: work is progressing 14/04: And XML test case format has been defined and is being tested. 21/04. Schema for TDML defined. Need to define how this and the test cases will be made public 05/05: Work still progressing 12/05: Work still progressing 02/06: Work still progressing on technical and legal considerations ... 25/08: Will chase to allow Daffodil access to test cases. The WG should define how implementation confirm that they 'conform to DFDL v1' 01/09: IBM still progressing the legal aspect. Intends to publish 100 or so tests as soon as it can, ahead of a full compliance suite. 08/09: IBM still progressing 085 ALL: publicise Public comments phase to ensure a good review.. 14/04: see minutes 21/04: Press release, OMG and other standards bodies. 05/05: Alan and Steve H have contacted other standards bodies. Will ask them to add comments on spec 15/05: still no public comments 02/06: No public comments 16/06: Public comments period has ended with no external comments. Alan had posted changes made in draft 041. Steve suggested send a note to the WG highlighting these changes. Steve also suggested requesting an extension as other IBM groups may review. We discussed whether this was necessary as changes will need to be made during the implementation phase anyway. Alan to ask OGF what the process is for changes post public comment. 23/06: Still no comments. Alan will contact OGF to understand the rest of the process. 30/06: Alan has emailed Joel asking what the process is now public comment period is over and can we update the published version with WG updates. No response yet. 07/07: No response. Alan will chase up 14/07: No response from Joel. Sent email to Greg Newby by no response. 21/07: Still no response. 04/08: Joel has responded that it is up to the WG to decide if the changes are significant enough to need additional review. Alan to contact David Martin and Erwin Laure for guidance if we split the specification. 11/08: Received a response from Joel that the WG can decide if a re- public review is necessary before becoming a 'proposed recommendation'. Alan responded that the WG agreed that a re-review was not necessary. The next stage is for OGF review committee to approve publication. 11/08: Specification is now 'awaiting author changes' before being submitted to the OGF technical committee for approval as a 'proposed specification'. Alan would like to have the updated specification complete by Sept 10th. The WG needs to complete all actions by then or decide that they do not need to be included in this phase of the process. 01/09: Alan and Steve have discussed and propose Sept 30th for completion of draft 43 and closure of all actions. 08/09: Target for completion September 30. 099 Splitting the specification in simpler sections. 07/07: Steve sent a proposal but not discussed. Alan will arrange a separate call. 14/07:Discussed Steve's proposal and Suman's and Alan's comments. Need to add choice, validation, facets. Also how does an implementation declare which subsets it supports. Suggested levels and/or profiles. Steve highlighted a problem when a DFDL schema from an implementation of just the core functions was moved to a full DFDL implementation what should happen about the missing properties. Does the full implementation need to be aware of subsets of functions? Should it raise a schema definition error for use of a function not in the subset. 21/07: no progress 04/08: Steve had updated proposed groups of function. (Subset_proposal_v2.ppt). We discussed whether its is better to have discrete sets of functions or expanding levels of function. Purpose of subsetting is: 1. Allow simpler implementations. (main purpose) 2. Simplify tooling 3. Simplify specification. Steve to contact previous members of WG to check if we have the correct subsets 11/08: Steve sent an email to previous members of the WG asking for opinions on splitting the specification. Bob McGrath from National Center For Supercomputing responded that they had implemented about 80% of the function. Alejandro will send a description of the function they have implemented. Action will be raised to track the Daffodil implementation 11/08: not discussed 01/09: NCSA implementation description received. Making the unparser optional is a good idea (NCSA do not need one) . Work will progress on the subsets. 08/09: No progress 101 Semantics of 'fixed' 21/07: Discussed whether not matching the 'fixed' value should be a validation error or processing error. Decided that for consistency it should be a validation error. It would be useful however to avoid having to duplication of facet information in an assert which could become unwieldy for, say, a large enumeration. Suggestions - a parser option that 'converted all validation errors to processing errors' - a dfdl expression function that 'applied all facets' or 'applied specific facet' to a particular element. Stephanie will produce some examples of how this could be used.. 04/08: Stephanie had produced examples but they were not discussed due to lack of time 11/08: We started to discuss Stephanie's HIPPA example but ran out of time. 25/08: Not discussed 01/09: Discuss next week 08/09: Stephanie sent an example of an X12 document showing how an element with the same name was defined in different groups with different enumerations. Proposal: - xs:fixed will not be used for parsing but only for validation and for providing a default value on unparsing. - A new dfdl function will be defined that applies only to simple element and tests whether the element exists including applying all the schema facets. (need to check with Tim why he wanted to only apply enumerations) dfdl:exists( xpath , true ¦ false) true means apply facets, false means don'e apply facets. <xs:element ref="REF_BillingProviderTaxIdentification_2010AA"> <xs:annotation> <xs:documentation>Discrimination needed to distinguish REF segments</xs:documentation> <xs:appinfo source=" http://www.ogf.org/dfdl/"> <dfdl:discriminator test="{dfdl:exists(./REF01__ReferenceIdentificationQualifier, true)}"/> </xs:appinfo> </xs:annotation> 107 teston/testoff dfdl expression functions. Are these functions still needed. They were introduced to allow individual bits to be set in a byte. Steve to look at TLog and ISO 8583 formats that use existence flags to see if they are still required. 04/08: Not discussed 11/08: Not discussed 25/08: Not discussed 01/09: Steve to progress by Sept 30th 08/09: Steve to progress by Sept 30th 108 dfdl:hidden There has been some discussion on whether the 'hidden' global group should be indicated in some way. 04/08: A lively discussion. The specification is works as currently defined so whether changes need to be made to make tooling easier. There shouldn't be 'conventions' in particular tooling as they must be able to properly deal with schema from other tools that would not obey those conventions. Steve stated that it is often dangerous to hide too much from users when they can see they underlying schema. To be continued. 25/08: there has been some offline discussions about simplifying how hidden elements are implemented. The proposal is dfdl:hidden property on xs:element only xs:minOccurs and xs:maxOccurs MUST be 0 when hidden dfdl:minOccurs and dfdl:maxOccurs for hidden elements only. An element is 'required' when dfdl:minOccurs >0 and normal default processing occurs. The schema, without dfdl annotations, must match the infoset so assumption is that non-DFDL tools, such as mappers, will ignore/not show elements with xs:minOccurs and xs:maxOccurs = '0' 01/09: The above proposal is flawed due to use of maxOccurs = 0 (this was identified back in 2008 hence current spec). Bob confirmed that NCSA models use hidden in a big way, so punting hidden beyond 1.0 is not an option. Two candidates: - As per spec but with syntactic improvements to make it clear that the two xs:sequences do not take any dfdl:sequence properties - Place a flag directly on a local element and force minOccurs to be 0. Simpler syntax but the semantic changes, as the element *could* be legally in the infoset, although a DFDL parser would never put it there. Steve will circulate the two proposals for next week. Bob to talk to Alejandro as the NCSA implementation is currently more flexible than the spec, allowing the groupref to point to a choice, and an elementref. Are these really needed? 08/09: Discussed the Global Group and Hidden Flag approaches. Decided to stay with Global Group with dfdl:sequence properties rather than the dfdl:hidden annotation. It was agreed that there would be no extra properties on the 'hidden' global group as the syntax was messy as it should really be on the sequence and there are currently no dfdl properties on global groups. Global group approach Summary: Particle to hide can be a local element, element ref, local sequence, local choice or group ref Particle is removed from its parent into a dedicated global group of composition sequence and replaced in the parent by a new empty local sequence The new empty local sequence carries a dfdl:hiddenGroupRef property, other DFDL properties are not allowed Pros: Removal of all DFDL annotations and use of the resultant pure XSD results in same infoset Global group can be reused Cons: Making something hidden is a refactor operation Global group sequence needs DFDL properties setting correctly The Daffodil parser allows the hidden annotation to reference global elements in addition to global groups. It was noted that this lost the particle properties but we need to discuss with Alejandro. 111 Daffodil DFDL parser 11/08: Bob and Alejandro described the new implementation that they have developed. It is a new code base and is not based on the Deffudle prototype. It is written in scala and implements approximately 80% of the features in the public comments draft of DFDL V1. Alejandro will send a list of the features not implemented. We discussed the scenarios that motivated the development which was to extract data from various sources and transform into canonical formats. Bob offered to make Daffodil available for the WG to assess the functionality. IBM WG members will get approval the company to allow them to receive Daffodil. Bob raised the question that if Daffodil becomes the public implementation of DFDL then we will need to work out how that would be funded and managed. It would be helpful if IBM test cases were available to Daffodil. IBM will investigate 25/08: Alejandro had sent a list of the functions that he has implemented and Steve ahd responding indicating the extra functions he thought were essential. Since then Alejandro has implemented some of the missing functions, such as escape schemes, pre-defined variables, binary decimal numbers, etc, and will update his list. Bob is planning to make the parser available on the internet to allow testing. His organisation is being reorganised and he doesn't know what the priority of Daffodill will be so it is essential that we move quickly. It would help if IBM could indicate its support for Daffodil in some semi-formal way. 01/09: Alejandro updating Daffodil to include escape schemes, unordered sequences and ignoreCase. Daffodil being placed under formal source control in anticipation of external release. Bob has a start October deadline to create a report on what has been done for his sponsors. It would be great if we could get Daffodil on the web and have run some IBM tests so it could be highlighted at OGF 30 at end October. 08/09: Alejandro is marking up Spec draft 42 to indicate which features Daffodil implement. Bob expects Daffodil to be available on the web soon. 112 DFDL certification process 25/08: Discussed how to certify DFDL implementations. Alan to investigate if OGF have a defined process. 01/09: In progress, spec needs to state what conformance means, as part of this work 08/09: Discussed what needs to be said in the spec and agreed that details of a conformance test suite should be in another document. Alan to draft conformance section. 113 Regular Expressions. 25/08: The DFDL regular expressions should provide lookahead and backreferences. Is the current regular expression language sufficient? a. Is the XML regular expression language the correct one to use. Tim asked if DFDL needs to specify an language at all and should leave it to implementers to pick one. That would inhibit portability of schema. 01/09: There are many variations of regexp language, it seems wise to specify one that we know contains functions like lookaround, which makes it easy to say things like 'give me everything up to but not including x'. This rules out XML Schema and POSIX, it needs Perl 5 or Java. 08/09: Agreed that specification should define the regular expression language (if only by referring to other specifications) . Should allow a common subset of PERL and Java expressions languages. Alan to update regular expression section. 113b Regular Expressions for Assert/Discriminator. 25/08: The DFDL regular expressions should provide lookahead and backreferences. Is the current regular expression language sufficient? b. A regular expression property on an assert/discriminator as an alternative to the test expression. Either a DFDL expression or a regular expression could be specified but not both. 01/09: Tim to convince Steve (via example) that use of regexp in asserts is needed in 1.0. 08/09: Agreed that this is a useful function Allowed as alternative to expression on dfdl:assert and dfdl:discriminator Pattern may be specified as attribute or element value Attribute: new testPattern attribute Element value: braces ( ) indicate pattern instead of expression 114 OGF 30 25/08: OGF30 takes place on October 25-29 in Brussels. Should we have a WG session? 09/01: Given emergence of NCSA implementation and spec completion target of 30th Sept it makes sense to host a session at OGF 30. 08/09: Steve to request permission to go 115 Clarify allowed lengths for signed integer types when rep is binary integer (ie, two's complement) 01/09: No technical reason to restrict lengths to 2^x bytes, could be odd, could be bits. But rare in practise so if we do relax, limit any core subset to 2^x bytes. 08/09: not discussed 117 3. Is UTF-16 a fixed width or variable width encoding Appendix A: About UTF-16 and Unicode Character Codes above 0xFFFF When we define UTF-16 to be a fixed-width double-byte wide character set we say that each UTF-16 codepoint is represented by 2 bytes. Notice the careful use of the term 'codepoint' here. Unicode/ISO10646 characters can have character codes as large as 0x10FFFF which requires 3 bytes to store (21 bits actually); however in UTF-16 characters with more than 2 bytes of code are encoded as two codepoints, called a surrogate pair; hence, UTF-16 is fixed-width, 2 bytes per codepoint. It is not 2 bytes per Unicode character. UTF-16 is really a variable-width encoding, but the characters that require the surrogate-pair treatment are so infrequently used that UTF-16 is most often treated like a 16-bit fixed-width character set. It is the acknowledgement of the existence of surrogate pairs that leads to the ?codepoint? vs. ?character code? distinction. UTF-32 is a fixed width encoding with a full 4-bytes per character code. It represents all of Unicode with the same width per character. Hence, when we refer to lengths in character strings we will often refer to length in characters, but we qualify that it means 2-byte codepoints when the character set encoding is UTF-16. Hence, when the property lengthUnitKind is 'characters' and the charset is 'UTF-16', then the units are actually 16-bit codepoints, not Unicode characters. Proposal -UCS2 is a fixed length encoding -UTF-16 is a variable width encoding. - A new property dfdl:UTF16Fixed 'yes ¦ no' treat UTF-16 as a fixed width encoding Closed actions No Action Work items: No Item target version status 005 Improvements on property descriptions not started 012 Reordering the properties discussion: move representation earlier, improve flow of topics not started 036 Update dfdl schema with change properties ongoing 042 Mapping of the DFDL infoset to XDM none not required for V1 specification 070 Write DFDL primer 071 Write test cases. 083 Implement RFC2116 109 Add 'message' attribute to dfdl:discriminator 01/09: Closed: Conclusion was that this is genuinely useful, and has low implementation cost. Will add a 'message' attribute to dfdl:discriminator. 43 not started 110 Clarify expression limitations for defineVariable, newVariableInstance and setVariable 01/09: Closed: Spec should distinguish newVariableInstance defaultValue from setVariable value. For newVariableInstance defaultValue, disallow downward references and references to self (must be usable from the point of declaration) For setVariable allow downward references and references to self, and always evaluate at end of component. (defineVariable defaultValue should be same as newVariableInstance) 43 not started 113 Be specific about regular expression syntax 43 not started 108 Updates to hidden mechanism 43 not started 99 Updates to reflect subsetting and unparser optionality 43 not started 112 Define what conformance to spec means 43 not started 115 Clarify allowed lengths for signed binary integers 43 not started 116 2. xs:minLength The spec currently states When an element declaration specifies a default value, and has type xs:string, then xs:minLength must be specified and must be 1 or greater. It is a schema definition error otherwise. The process for defaults and nils means this restriction is no longer needed. Agreed Regards Alan Powell Development - MQSeries, Message Broker, ESB IBM Software Group, Application and Integration Middleware Software ------------------------------------------------------------------------------------------------------------------------------------------- IBM MP211, Hursley Park Hursley, SO21 2JN United Kingdom Phone: +44-1962-815073 e-mail: alan_powell(a)uk.ibm.com Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU

2 2

Agenda for OGF DFDL WG call 15 Septembeer 2010 15:00UK (10:00 ET)
by Alan Powell 14 Sep '10

14 Sep '10

1. Current Actions 2. Document that an empty sequence that is the content of complex type is ignored even when it has annotations One thing to point out is that the authors should avoid <xs:complexType> <xs:sequence dfdl:hiddenGroupRef="..."/> </xs:complexType> (The same applies to other annotations on sequences, long- or short-form.) The schema spec will discard that sequence (see [1] definition of "effective content" clause 2.1.2). The following works: <xs:complexType> <xs:sequence> <xs:sequence dfdl:hiddenGroupRef="..."/> </xs:sequence> </xs:complexType> [1] http://www.w3.org/TR/2004/REC-xmlschema-1-20041028/#key-exg Current Actions: No Action 066 Investigate format for defining test cases 25/11:IBM to see if it is possible to publish its test case format. 04/12: no update ... 17/02: IBM is willing in principle to publish the test case format and some of the test cases. May need some time to build a 'compliance suite' 24/03: No progress 03/03: Discussions have been taking place on the subset of tests that will be provided. 10/03: work is progressing 17/03: work is progressing 31/03: work is progressing 14/04: And XML test case format has been defined and is being tested. 21/04. Schema for TDML defined. Need to define how this and the test cases will be made public 05/05: Work still progressing 12/05: Work still progressing 02/06: Work still progressing on technical and legal considerations ... 25/08: Will chase to allow Daffodil access to test cases. The WG should define how implementation confirm that they 'conform to DFDL v1' 01/09: IBM still progressing the legal aspect. Intends to publish 100 or so tests as soon as it can, ahead of a full compliance suite. 08/09: IBM still progressing 085 ALL: publicise Public comments phase to ensure a good review.. 14/04: see minutes 21/04: Press release, OMG and other standards bodies. 05/05: Alan and Steve H have contacted other standards bodies. Will ask them to add comments on spec 15/05: still no public comments 02/06: No public comments 16/06: Public comments period has ended with no external comments. Alan had posted changes made in draft 041. Steve suggested send a note to the WG highlighting these changes. Steve also suggested requesting an extension as other IBM groups may review. We discussed whether this was necessary as changes will need to be made during the implementation phase anyway. Alan to ask OGF what the process is for changes post public comment. 23/06: Still no comments. Alan will contact OGF to understand the rest of the process. 30/06: Alan has emailed Joel asking what the process is now public comment period is over and can we update the published version with WG updates. No response yet. 07/07: No response. Alan will chase up 14/07: No response from Joel. Sent email to Greg Newby by no response. 21/07: Still no response. 04/08: Joel has responded that it is up to the WG to decide if the changes are significant enough to need additional review. Alan to contact David Martin and Erwin Laure for guidance if we split the specification. 11/08: Received a response from Joel that the WG can decide if a re- public review is necessary before becoming a 'proposed recommendation'. Alan responded that the WG agreed that a re-review was not necessary. The next stage is for OGF review committee to approve publication. 11/08: Specification is now 'awaiting author changes' before being submitted to the OGF technical committee for approval as a 'proposed specification'. Alan would like to have the updated specification complete by Sept 10th. The WG needs to complete all actions by then or decide that they do not need to be included in this phase of the process. 01/09: Alan and Steve have discussed and propose Sept 30th for completion of draft 43 and closure of all actions. 08/09: Target for completion September 30. 099 Splitting the specification in simpler sections. 07/07: Steve sent a proposal but not discussed. Alan will arrange a separate call. 14/07:Discussed Steve's proposal and Suman's and Alan's comments. Need to add choice, validation, facets. Also how does an implementation declare which subsets it supports. Suggested levels and/or profiles. Steve highlighted a problem when a DFDL schema from an implementation of just the core functions was moved to a full DFDL implementation what should happen about the missing properties. Does the full implementation need to be aware of subsets of functions? Should it raise a schema definition error for use of a function not in the subset. 21/07: no progress 04/08: Steve had updated proposed groups of function. (Subset_proposal_v2.ppt). We discussed whether its is better to have discrete sets of functions or expanding levels of function. Purpose of subsetting is: 1. Allow simpler implementations. (main purpose) 2. Simplify tooling 3. Simplify specification. Steve to contact previous members of WG to check if we have the correct subsets 11/08: Steve sent an email to previous members of the WG asking for opinions on splitting the specification. Bob McGrath from National Center For Supercomputing responded that they had implemented about 80% of the function. Alejandro will send a description of the function they have implemented. Action will be raised to track the Daffodil implementation 11/08: not discussed 01/09: NCSA implementation description received. Making the unparser optional is a good idea (NCSA do not need one) . Work will progress on the subsets. 08/09: No progress 101 Semantics of 'fixed' 21/07: Discussed whether not matching the 'fixed' value should be a validation error or processing error. Decided that for consistency it should be a validation error. It would be useful however to avoid having to duplication of facet information in an assert which could become unwieldy for, say, a large enumeration. Suggestions - a parser option that 'converted all validation errors to processing errors' - a dfdl expression function that 'applied all facets' or 'applied specific facet' to a particular element. Stephanie will produce some examples of how this could be used.. 04/08: Stephanie had produced examples but they were not discussed due to lack of time 11/08: We started to discuss Stephanie's HIPPA example but ran out of time. 25/08: Not discussed 01/09: Discuss next week 08/09: Stephanie sent an example of an X12 document showing how an element with the same name was defined in different groups with different enumerations. Proposal: - xs:fixed will not be used for parsing but only for validation and for providing a default value on unparsing. - A new dfdl function will be defined that applies only to simple element and tests whether the element exists including applying all the schema facets. (need to check with Tim why he wanted to only apply enumerations) dfdl:exists( xpath , true ¦ false) true means apply facets, false means don'e apply facets. <xs:element ref="REF_BillingProviderTaxIdentification_2010AA"> <xs:annotation> <xs:documentation>Discrimination needed to distinguish REF segments</xs:documentation> <xs:appinfo source=" http://www.ogf.org/dfdl/"> <dfdl:discriminator test="{dfdl:exists(./REF01__ReferenceIdentificationQualifier, true)}"/> </xs:appinfo> </xs:annotation> 107 teston/testoff dfdl expression functions. Are these functions still needed. They were introduced to allow individual bits to be set in a byte. Steve to look at TLog and ISO 8583 formats that use existence flags to see if they are still required. 04/08: Not discussed 11/08: Not discussed 25/08: Not discussed 01/09: Steve to progress by Sept 30th 08/09: Steve to progress by Sept 30th 108 dfdl:hidden There has been some discussion on whether the 'hidden' global group should be indicated in some way. 04/08: A lively discussion. The specification is works as currently defined so whether changes need to be made to make tooling easier. There shouldn't be 'conventions' in particular tooling as they must be able to properly deal with schema from other tools that would not obey those conventions. Steve stated that it is often dangerous to hide too much from users when they can see they underlying schema. To be continued. 25/08: there has been some offline discussions about simplifying how hidden elements are implemented. The proposal is dfdl:hidden property on xs:element only xs:minOccurs and xs:maxOccurs MUST be 0 when hidden dfdl:minOccurs and dfdl:maxOccurs for hidden elements only. An element is 'required' when dfdl:minOccurs >0 and normal default processing occurs. The schema, without dfdl annotations, must match the infoset so assumption is that non-DFDL tools, such as mappers, will ignore/not show elements with xs:minOccurs and xs:maxOccurs = '0' 01/09: The above proposal is flawed due to use of maxOccurs = 0 (this was identified back in 2008 hence current spec). Bob confirmed that NCSA models use hidden in a big way, so punting hidden beyond 1.0 is not an option. Two candidates: - As per spec but with syntactic improvements to make it clear that the two xs:sequences do not take any dfdl:sequence properties - Place a flag directly on a local element and force minOccurs to be 0. Simpler syntax but the semantic changes, as the element *could* be legally in the infoset, although a DFDL parser would never put it there. Steve will circulate the two proposals for next week. Bob to talk to Alejandro as the NCSA implementation is currently more flexible than the spec, allowing the groupref to point to a choice, and an elementref. Are these really needed? 08/09: Discussed the Global Group and Hidden Flag approaches. Decided to stay with Global Group with dfdl:sequence properties rather than the dfdl:hidden annotation. It was agreed that there would be no extra properties on the 'hidden' global group as the syntax was messy as it should really be on the sequence and there are currently no dfdl properties on global groups. Global group approach Summary: Particle to hide can be a local element, element ref, local sequence, local choice or group ref Particle is removed from its parent into a dedicated global group of composition sequence and replaced in the parent by a new empty local sequence The new empty local sequence carries a dfdl:hiddenGroupRef property, other DFDL properties are not allowed Pros: Removal of all DFDL annotations and use of the resultant pure XSD results in same infoset Global group can be reused Cons: Making something hidden is a refactor operation Global group sequence needs DFDL properties setting correctly The Daffodil parser allows the hidden annotation to reference global elements in addition to global groups. It was noted that this lost the particle properties but we need to discuss with Alejandro. 111 Daffodil DFDL parser 11/08: Bob and Alejandro described the new implementation that they have developed. It is a new code base and is not based on the Deffudle prototype. It is written in scala and implements approximately 80% of the features in the public comments draft of DFDL V1. Alejandro will send a list of the features not implemented. We discussed the scenarios that motivated the development which was to extract data from various sources and transform into canonical formats. Bob offered to make Daffodil available for the WG to assess the functionality. IBM WG members will get approval the company to allow them to receive Daffodil. Bob raised the question that if Daffodil becomes the public implementation of DFDL then we will need to work out how that would be funded and managed. It would be helpful if IBM test cases were available to Daffodil. IBM will investigate 25/08: Alejandro had sent a list of the functions that he has implemented and Steve ahd responding indicating the extra functions he thought were essential. Since then Alejandro has implemented some of the missing functions, such as escape schemes, pre-defined variables, binary decimal numbers, etc, and will update his list. Bob is planning to make the parser available on the internet to allow testing. His organisation is being reorganised and he doesn't know what the priority of Daffodill will be so it is essential that we move quickly. It would help if IBM could indicate its support for Daffodil in some semi-formal way. 01/09: Alejandro updating Daffodil to include escape schemes, unordered sequences and ignoreCase. Daffodil being placed under formal source control in anticipation of external release. Bob has a start October deadline to create a report on what has been done for his sponsors. It would be great if we could get Daffodil on the web and have run some IBM tests so it could be highlighted at OGF 30 at end October. 08/09: Alejandro is marking up Spec draft 42 to indicate which features Daffodil implement. Bob expects Daffodil to be available on the web soon. 112 DFDL certification process 25/08: Discussed how to certify DFDL implementations. Alan to investigate if OGF have a defined process. 01/09: In progress, spec needs to state what conformance means, as part of this work 08/09: Discussed what needs to be said in the spec and agreed that details of a conformance test suite should be in another document. Alan to draft conformance section. 113 Regular Expressions. 25/08: The DFDL regular expressions should provide lookahead and backreferences. Is the current regular expression language sufficient? a. Is the XML regular expression language the correct one to use. Tim asked if DFDL needs to specify an language at all and should leave it to implementers to pick one. That would inhibit portability of schema. 01/09: There are many variations of regexp language, it seems wise to specify one that we know contains functions like lookaround, which makes it easy to say things like 'give me everything up to but not including x'. This rules out XML Schema and POSIX, it needs Perl 5 or Java. 08/09: Agreed that specification should define the regular expression language (if only by referring to other specifications) . Should allow a common subset of PERL and Java expressions languages. Alan to update regular expression section. 113b Regular Expressions for Assert/Discriminator. 25/08: The DFDL regular expressions should provide lookahead and backreferences. Is the current regular expression language sufficient? b. A regular expression property on an assert/discriminator as an alternative to the test expression. Either a DFDL expression or a regular expression could be specified but not both. 01/09: Tim to convince Steve (via example) that use of regexp in asserts is needed in 1.0. 08/09: Agreed that this is a useful function Allowed as alternative to expression on dfdl:assert and dfdl:discriminator Pattern may be specified as attribute or element value Attribute: new testPattern attribute Element value: braces ( ) indicate pattern instead of expression 114 OGF 30 25/08: OGF30 takes place on October 25-29 in Brussels. Should we have a WG session? 09/01: Given emergence of NCSA implementation and spec completion target of 30th Sept it makes sense to host a session at OGF 30. 08/09: Steve to request permission to go 115 Clarify allowed lengths for signed integer types when rep is binary integer (ie, two's complement) 01/09: No technical reason to restrict lengths to 2^x bytes, could be odd, could be bits. But rare in practise so if we do relax, limit any core subset to 2^x bytes. 08/09: not discussed 117 3. Is UTF-16 a fixed width or variable width encoding Appendix A: About UTF-16 and Unicode Character Codes above 0xFFFF When we define UTF-16 to be a fixed-width double-byte wide character set we say that each UTF-16 codepoint is represented by 2 bytes. Notice the careful use of the term 'codepoint' here. Unicode/ISO10646 characters can have character codes as large as 0x10FFFF which requires 3 bytes to store (21 bits actually); however in UTF-16 characters with more than 2 bytes of code are encoded as two codepoints, called a surrogate pair; hence, UTF-16 is fixed-width, 2 bytes per codepoint. It is not 2 bytes per Unicode character. UTF-16 is really a variable-width encoding, but the characters that require the surrogate-pair treatment are so infrequently used that UTF-16 is most often treated like a 16-bit fixed-width character set. It is the acknowledgement of the existence of surrogate pairs that leads to the ?codepoint? vs. ?character code? distinction. UTF-32 is a fixed width encoding with a full 4-bytes per character code. It represents all of Unicode with the same width per character. Hence, when we refer to lengths in character strings we will often refer to length in characters, but we qualify that it means 2-byte codepoints when the character set encoding is UTF-16. Hence, when the property lengthUnitKind is 'characters' and the charset is 'UTF-16', then the units are actually 16-bit codepoints, not Unicode characters. Proposal -UCS2 is a fixed length encoding -UTF-16 is a variable width encoding. - A new property dfdl:UTF16Fixed 'yes ¦ no' treat UTF-16 as a fixed width encoding Regards Alan Powell Development - MQSeries, Message Broker, ESB IBM Software Group, Application and Integration Middleware Software ------------------------------------------------------------------------------------------------------------------------------------------- IBM MP211, Hursley Park Hursley, SO21 2JN United Kingdom Phone: +44-1962-815073 e-mail: alan_powell(a)uk.ibm.com Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU

1 0

Hidden elements - summary of approaches
by Steve Hanson 08 Sep '10

08 Sep '10

Let's state the two options being considered, as I said I'd do this for the wider DFDL WG for the call today: 1) Global group approach Summary: Particle to hide can be a local element, element ref, local sequence, local choice or group ref Particle is removed from its parent into a dedicated global group of composition sequence and replaced in the parent by a new empty local sequence The new empty local sequence carries a dfdl:hidden annotation that has a property dfdl:groupRef, other DFDL properties are not allowed Alternatively, the new empty local sequence carries a dfdl:hiddenGroupRef property, other DFDL properties are not allowed Pros: Removal of all DFDL annotations and use of the resultant pure XSD results in same infoset Global group can be reused Cons: Making something hidden is a refactor operation Global group sequence needs DFDL properties setting correctly 2) Hidden flag approach Summary: Particle to hide can be a local element, element ref Particle takes a dfdl:hidden property xs:minOccurs MUST be 0 A dfdl:minOccurs property takes the place of xs:minOccurs. Pros: Easy to make something hidden Cons: Removal of all DFDL annotations and using pure XSD does not guarantee the same infoset Breaks validation Duplication of minOccurs property Have to wrap a local sequence, choice or group ref in a complex element in order to hide it (they can't take minOccurs = 0) Regards Steve Hanson Strategy, Common Transformation & DFDL Co-Chair, OGF DFDL WG IBM SWG, Hursley, UK, smh(a)uk.ibm.com, tel +44-(0)1962-815848 Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU

1 0

Agenda for OGF DFDL WG call 8 Septembeer 2010 15:00UK (10:00 ET)
by Alan Powell 07 Sep '10

07 Sep '10

1. Current Actions 2. xs:minLength The spec currently states When an element declaration specifies a default value, and has type xs:string, then xs:minLength must be specified and must be 1 or greater. It is a schema definition error otherwise. The process for defaults and nils means this restriction is no longer needed. 3. Is UTF-16 a fixed width or variable width encoding Appendix A: About UTF-16 and Unicode Character Codes above 0xFFFF When we define UTF-16 to be a fixed-width double-byte wide character set we say that each UTF-16 codepoint is represented by 2 bytes. Notice the careful use of the term 'codepoint' here. Unicode/ISO10646 characters can have character codes as large as 0x10FFFF which requires 3 bytes to store (21 bits actually); however in UTF-16 characters with more than 2 bytes of code are encoded as two codepoints, called a surrogate pair; hence, UTF-16 is fixed-width, 2 bytes per codepoint. It is not 2 bytes per Unicode character. UTF-16 is really a variable-width encoding, but the characters that require the surrogate-pair treatment are so infrequently used that UTF-16 is most often treated like a 16-bit fixed-width character set. It is the acknowledgement of the existence of surrogate pairs that leads to the ?codepoint? vs. ?character code? distinction. UTF-32 is a fixed width encoding with a full 4-bytes per character code. It represents all of Unicode with the same width per character. Hence, when we refer to lengths in character strings we will often refer to length in characters, but we qualify that it means 2-byte codepoints when the character set encoding is UTF-16. Hence, when the property lengthUnitKind is 'characters' and the charset is 'UTF-16', then the units are actually 16-bit codepoints, not Unicode characters. Current Actions: No Action 066 Investigate format for defining test cases 25/11:IBM to see if it is possible to publish its test case format. 04/12: no update ... 17/02: IBM is willing in principle to publish the test case format and some of the test cases. May need some time to build a 'compliance suite' 24/03: No progress 03/03: Discussions have been taking place on the subset of tests that will be provided. 10/03: work is progressing 17/03: work is progressing 31/03: work is progressing 14/04: And XML test case format has been defined and is being tested. 21/04. Schema for TDML defined. Need to define how this and the test cases will be made public 05/05: Work still progressing 12/05: Work still progressing 02/06: Work still progressing on technical and legal considerations ... 25/08: Will chase to allow Daffodil access to test cases. The WG should define how implementation confirm that they 'conform to DFDL v1' 01/09: IBM still progressing the legal aspect. Intends to publish 100 or so tests as soon as it can, ahead of a full compliance suite. 085 ALL: publicise Public comments phase to ensure a good review.. 14/04: see minutes 21/04: Press release, OMG and other standards bodies. 05/05: Alan and Steve H have contacted other standards bodies. Will ask them to add comments on spec 15/05: still no public comments 02/06: No public comments 16/06: Public comments period has ended with no external comments. Alan had posted changes made in draft 041. Steve suggested send a note to the WG highlighting these changes. Steve also suggested requesting an extension as other IBM groups may review. We discussed whether this was necessary as changes will need to be made during the implementation phase anyway. Alan to ask OGF what the process is for changes post public comment. 23/06: Still no comments. Alan will contact OGF to understand the rest of the process. 30/06: Alan has emailed Joel asking what the process is now public comment period is over and can we update the published version with WG updates. No response yet. 07/07: No response. Alan will chase up 14/07: No response from Joel. Sent email to Greg Newby by no response. 21/07: Still no response. 04/08: Joel has responded that it is up to the WG to decide if the changes are significant enough to need additional review. Alan to contact David Martin and Erwin Laure for guidance if we split the specification. 11/08: Received a response from Joel that the WG can decide if a re- public review is necessary before becoming a 'proposed recommendation'. Alan responded that the WG agreed that a re-review was not necessary. The next stage is for OGF review committee to approve publication. 11/08: Specification is now 'awaiting author changes' before being submitted to the OGF technical committee for approval as a 'proposed specification'. Alan would like to have the updated specification complete by Sept 10th. The WG needs to complete all actions by then or decide that they do not need to be included in this phase of the process. 01/09: Alan and Steve have discussed and propose Sept 30th for completion of draft 43 and closure of all actions. 099 Splitting the specification in simpler sections. 07/07: Steve sent a proposal but not discussed. Alan will arrange a separate call. 14/07:Discussed Steve's proposal and Suman's and Alan's comments. Need to add choice, validation, facets. Also how does an implementation declare which subsets it supports. Suggested levels and/or profiles. Steve highlighted a problem when a DFDL schema from an implementation of just the core functions was moved to a full DFDL implementation what should happen about the missing properties. Does the full implementation need to be aware of subsets of functions? Should it raise a schema definition error for use of a function not in the subset. 21/07: no progress 04/08: Steve had updated proposed groups of function. (Subset_proposal_v2.ppt). We discussed whether its is better to have discrete sets of functions or expanding levels of function. Purpose of subsetting is: 1. Allow simpler implementations. (main purpose) 2. Simplify tooling 3. Simplify specification. Steve to contact previous members of WG to check if we have the correct subsets 11/08: Steve sent an email to previous members of the WG asking for opinions on splitting the specification. Bob McGrath from National Center For Supercomputing responded that they had implemented about 80% of the function. Alejandro will send a description of the function they have implemented. Action will be raised to track the Daffodil implementation 11/08: not discussed 01/09: NCSA implementation description received. Making the unparser optional is a good idea (NCSA do not need one) . Work will progress on the subsets. 101 Semantics of 'fixed' 21/07: Discussed whether not matching the 'fixed' value should be a validation error or processing error. Decided that for consistency it should be a validation error. It would be useful however to avoid having to duplication of facet information in an assert which could become unwieldy for, say, a large enumeration. Suggestions - a parser option that 'converted all validation errors to processing errors' - a dfdl expression function that 'applied all facets' or 'applied specific facet' to a particular element. Stephanie will produce some examples of how this could be used.. 04/08: Stephanie had produced examples but they were not discussed due to lack of time 11/08: We started to discuss Stephanie's HIPPA example but ran out of time. 25/08: Not discussed 01/09: Discuss next week 107 teston/testoff dfdl expression functions. Are these functions still needed. They were introduced to allow individual bits to be set in a byte. Steve to look at TLog and ISO 8583 formats that use existence flags to see if they are still required. 04/08: Not discussed 11/08: Not discussed 25/08: Not discussed 01/09: Steve to progress by Sept 30th 108 dfdl:hidden There has been some discussion on whether the 'hidden' global group should be indicated in some way. 04/08: A lively discussion. The specification is works as currently defined so whether changes need to be made to make tooling easier. There shouldn't be 'conventions' in particular tooling as they must be able to properly deal with schema from other tools that would not obey those conventions. Steve stated that it is often dangerous to hide too much from users when they can see they underlying schema. To be continued. 25/08: there has been some offline discussions about simplifying how hidden elements are implemented. The proposal is dfdl:hidden property on xs:element only xs:minOccurs and xs:maxOccurs MUST be 0 when hidden dfdl:minOccurs and dfdl:maxOccurs for hidden elements only. An element is 'required' when dfdl:minOccurs >0 and normal default processing occurs. The schema, without dfdl annotations, must match the infoset so assumption is that non-DFDL tools, such as mappers, will ignore/not show elements with xs:minOccurs and xs:maxOccurs = '0' 01/09: The above proposal is flawed due to use of maxOccurs = 0 (this was identified back in 2008 hence current spec). Bob confirmed that NCSA models use hidden in a big way, so punting hidden beyond 1.0 is not an option. Two candidates: - As per spec but with syntactic improvements to make it clear that the two xs:sequences do not take any dfdl:sequence properties - Place a flag directly on a local element and force minOccurs to be 0. Simpler syntax but the semantic changes, as the element *could* be legally in the infoset, although a DFDL parser would never put it there. Steve will circulate the two proposals for next week. Bob to talk to Alejandro as the NCSA implementation is currently more flexible than the spec, allowing the groupref to point to a choice, and an elementref. Are these really needed? 111 Daffodil DFDL parser 11/08: Bob and Alejandro described the new implementation that they have developed. It is a new code base and is not based on the Deffudle prototype. It is written in scala and implements approximately 80% of the features in the public comments draft of DFDL V1. Alejandro will send a list of the features not implemented. We discussed the scenarios that motivated the development which was to extract data from various sources and transform into canonical formats. Bob offered to make Daffodil available for the WG to assess the functionality. IBM WG members will get approval the company to allow them to receive Daffodil. Bob raised the question that if Daffodil becomes the public implementation of DFDL then we will need to work out how that would be funded and managed. It would be helpful if IBM test cases were available to Daffodil. IBM will investigate 25/08: Alejandro had sent a list of the functions that he has implemented and Steve ahd responding indicating the extra functions he thought were essential. Since then Alejandro has implemented some of the missing functions, such as escape schemes, pre-defined variables, binary decimal numbers, etc, and will update his list. Bob is planning to make the parser available on the internet to allow testing. His organisation is being reorganised and he doesn't know what the priority of Daffodill will be so it is essential that we move quickly. It would help if IBM could indicate its support for Daffodil in some semi-formal way. 01/09: Alejandro updating Daffodil to include escape schemes, unordered sequences and ignoreCase. Daffodil being placed under formal source control in anticipation of external release. Bob has a start October deadline to create a report on what has been done for his sponsors. It would be great if we could get Daffodil on the web and have run some IBM tests so it could be highlighted at OGF 30 at end October. 112 DFDL certification process 25/08: Discussed how to certify DFDL implementations. Alan to investigate if OGF have a defined process. 01/09: In progress, spec needs to state what conformance means, as part of this work 113 2. Regular Expressions. 25/08: The DFDL regular expressions should provide lookahead and backreferences. Is the current regular expression language sufficient? Discussed two aspects: a. Is the XML regular expression language the correct one to use. Tim asked if DFDL needs to specify an language at all and should leave it to implementers to pick one. That would inhibit portability of schema. b. A regular expression property on an assert/discriminator as an alternative to the test expression. Either a DFDL expression or a regular expression could be specified but not both. 01/09: There are many variations of regexp language, it seems wise to specify one that we know contains functions like lookaround, which makes it easy to say things like 'give me everything up to but not including x'. This rules out XML Schema and POSIX, it needs Perl 5 or Java. Tim to convince Steve (via example) that use of regexp in asserts is needed in 1.0. 114 3. OGF 30 25/08: OGF30 takes place on October 25-29 in Brussels. Should we have a WG session? 09/01: Given emergence of NCSA implementation and spec completion target of 30th Sept it makes sense to host a session at OGF 30. Regards Alan Powell Development - MQSeries, Message Broker, ESB IBM Software Group, Application and Integration Middleware Software ------------------------------------------------------------------------------------------------------------------------------------------- IBM MP211, Hursley Park Hursley, SO21 2JN United Kingdom Phone: +44-1962-815073 e-mail: alan_powell(a)uk.ibm.com Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU

1 0

Definition of 'missing element' - some edge cases
by Tim Kimber 07 Sep '10

07 Sep '10

I'm going to send this and then duck - we've discussed the subject of missing-ness and defaulting at considerable length already. However, I genuinely do have some new information for your consideration so please hear me out. I'm seeking the opinion of the working group on the following questions: a) can an element reliably be categorised as 'missing' when separatorPolicy='suppressed'? b) is it possible for an element to be 'missing' if it has lengthKind='explicit' and its length is a static, non-zero value? c) is it possible for an element to be 'missing' if it has a discriminator that has already evaluated to 'true'. For reference, the specification ( v0.42 ) says this concerning missing elements: Definition 'missing element' On parsing, an element is missing if its content region in the data stream is empty. The initiator and terminator regions of a missing element may, or may not, also be empty as controlled by the dfdl:emptyValueDelimiterPolicy property (simple and complex element), or dfdl:nilValueDelimiterPolicy property (simple element), . Question a), Compare the following data streams. In both cases, assume that - separator is comma and separatorPosition is 'infix' - missingValueDelimiterPolicy is set to 'none' so a 'missing' value should not have an initiator. - the initiators are A:, B: and C: - values are a,b,c. separatorPolicy='required' : A:a,,C:c separatorPolicy='suppressed' : A:a,C:c In the 'required' case, the parser detects that the initiator is missing, then looks to see whether the content region is zero-length. It is, so the element is 'missing'. In the 'suppressed' case, the parser detects that the initiator is missing, then looks to see whether the content region is zero-length. It looks for a delimiter at the current position and finds 'C'. 'C' is not a delimiter, so the content region is not zero-length. So the parser throws a processing error - "initiator for element B was not found in the data". I don't think the 'suppressed' behaviour is what a user will expect, nor what the WG intended when these rules were drawn up. The problem is that the parser cannot reliably determine the length of the content region when separatorPolicy='suppressed'. It can, however, reliably detect whether the element is present - the initiator gives a strong hint about that. Somebody may say "well duh!. Of course the content region is empty if the initiator is not present". That may be a reasonable rule, but it is not the rule currently given in the specification. Note that the content region has not been looked at, so that rule relies on the parser speculatively parsing the element and then backtracking because the initiator is not found. If we allow that, then why not allow default values to be applied after other types of processing error ( even for cases where no initiator was defined )? There are good reasons for not applying defaults after normal backtracking ( hence the current rule ) so any such 'missing initiator implies empty content' rule would have to made explicit in the specification. Possible refinements of the rules: a) IF the length of the content region cannot reliably be determined ( lengthKind='delimited and separatorPolicy=suppressed ) AND emptyValueDelimiterPolicy does not include the initiator AND the element has an initiator AND the initiator was not found THEN assume that the content length is zero and treat the element as missing. or b) IF (the element has an initiator AND the initiator was not found )THEN IF the parent group has initiatedContent='yes' THEN the element is missing else apply the existing rules. b) would provide a way to get defaults applied in situations where the content region's length is either fixed or undefined. Quite a lot of users might assume this behaviour anyway. Question b) A similar situation can arise when lengthKind='explicit' and the length is fixed ( i.e. is not a DFDL expression ). Unless the missing field occurs at the end of a known-length structure the length of the content region will never be zero. I think a similar rule is required for this case also: - IF the length of the content region is fixed ( lengthKind='explicit' and length is a static, non-zero value ) AND emptyValueDelimiterPolicy does not include the initiator AND the element has an initiator AND the initiator was not found THEN assume that the content length is zero and treat the element as missing. ...or apply suggestion b) above. Question c) Suppose that an element has a discriminator, and it has already evaluated to 'true' ( it must have been a backward reference to some previously-parsed field ). The discriminator has unambiguously stated that the element *is* present in the data. If it is subsequently found to have a zero-length content region, should the parser treat it as 'missing' and attempt to apply a default?. I don't think so. Please tell me that I'm missing something obvious here - it's starting to sound complicated again. regards, Tim Kimber, Common Transformation Team, Hursley, UK Internet: kimbert(a)uk.ibm.com Tel. 01962-816742 Internal tel. 246742 Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU

1 0

Minutes of DFDL WG call September 1st 2010
by Steve Hanson 01 Sep '10

01 Sep '10

Regards Steve Hanson Strategy, Common Transformation & DFDL Co-Chair, OGF DFDL WG IBM SWG, Hursley, UK, smh(a)uk.ibm.com, tel +44-(0)1962-815848 Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU

1 0

Publication and notation of DFDL
by Jakob Voss 01 Sep '10

01 Sep '10

Hi, I stumbled upon DFDL searching about data archaeology - very interesting and relevant work! Is it already applied in practise for information preservation in libraries and archives? Unfortunately DFDL is not documented very well, compared to standards of W3C and similar institutions. Do you plan to set up a website with a more readable description of DFDL like other popular standards? json.org is one of the good examples because it describes the JSON standard easy to understand and with links to implementations. My second question is about the notation of DFDL. Has anyone tried to create a notation that is not based on XML? For instance Notation 3 is much more readable than RDF/XML and Backus-Naur-Form is more readable than a grammar formally defined in mathematical formulas. Especially if you describe non-XML formats it is a barrier to set up the whole XML framework stack in oder to use DFDL. I think that DFDL has strong potential but in the current form (both the way it is documented and its notation) it does not encourage potential users to adopt it. Cheers Jakob Voss -- Verbundzentrale des GBV (VZG) Digitale Bibliothek - Jakob Voß Platz der Goettinger Sieben 1 37073 Goettingen - Germany +49 (0)551 39-10242 http://www.gbv.de jakob.voss(a)gbv.de

3 2