November 2004 - dfdl-wg - lists.ogf.org

Fw: [dfdl-wg] Annotation complexity
by Suman Kalia 19 Nov '04

19 Nov '04

You are beginning to see the complexities of overrides in this very simple example. Consider complex type hierarchy which is four/five levels deep and then determine which override is applicable and where in the tree. I briefly mentioned in the call on Wednesday, that we should carefully determine which annotations are applicable for which constructs of XML schema and try to avoid the override mechanism. Annotations that exist on the element belong only to the element; there is no inheritance or override issue here. Annotations that exist on structural constructs (group/complex types etc) truly belong to the structure only ( such as data element separators, delimiter which cannot be associated with elements because elements could be reused via element Ref in other structures where they could have different delimiter in that structure etc). Once we have separated the types and elements; then annotations defined on derived types can follow the well established rules of inheritance ie inherit annotation from parent unless explicitly overridden etc.. Then comes the issue of defaults - where to locate and apply. Possible options are a) top level type ( which for example could be type corresponding to 01 level COBOL structure) b) A separate structure available at tooling and runtime which contains the defaults. We used the latter in our implementation. Suman Kalia IBM Toronto Lab WebSphere Business Integration Application Connectivity Tools Tel : 905-413-3923 T/L 969-3923 Fax : 905-413-4850 Internet ID : kalia(a)ca.ibm.com ----- Forwarded by Suman Kalia/Toronto/IBM on 11/19/2004 03:04 PM ----- "Myers, James D" <jim.myers(a)pnl.gov> Sent by: owner-dfdl-wg(a)ggf.org 11/19/2004 02:25 PM To dfdl-wg(a)gridforum.org cc Subject RE: [dfdl-wg] Annotation complexity I guess I'm not sure how restricting annotations to elements solves things. I think I can recreate the problems in Martin's examples without putting annotations on types: The issue of it being hard to understand that triple overrides the dfdlfromstrings param would seem to be the same whether the triple type has an annotation or if some subelements within it get annotations (either first second and third, or consider a triple type that specifies an annotated element containing those three). In all these cases, it is clear that you have to walk down the logical hierarchy which is broken into parts in the dfdl/xsd file and keep a stack of contexts if we allow any default/scoped annotations. If annotations are allowed on both types and elements, what I find even more difficult are situations where the triple type has one default, and the element in "data" with that type has an annotation specifying the opposite param value. Do we consider the element to be above the type in the scope hierarchy? For more fun, what if triple is derived some other type where the annotation is defined. Would an annotation on the "data" element be inherited by the sub-element of type triple, or would the inheritance from the triple base type win (i.e. neither the element of the type triple or the triple type itself are directly annotated). (Or consider an annotation on the "first" element defined in the base type for triple rather than on the base type directly - does the element annotation inherited from the type hierarchy trump the one from the element hierarchy?) An attempt at a picture where only elements have annotations: Element A : param=littleendian SubElement B: type ST Type T: SubElement C: param: bigendian Type ST: subtype of T What is the param value of element C at A/B/C? I guess I see a need to keep some hierarchically scoped defaults (a file that has some ascii info and then a base64 encoded section of littleendian stuff), but xsd makes it hard to define a single hierarchy. Perhaps some rule of precedence - resolve annotations from type to subtype first, then push those onto the stack of element scopes - would make things unambiguous, if not user friendly. Jim > -----Original Message----- > From: owner-dfdl-wg(a)ggf.org [mailto:owner-dfdl-wg@ggf.org] On > Behalf Of Martin Westhead > Sent: Friday, November 19, 2004 1:38 PM > To: Martin Westhead > Cc: dfdl-wg(a)gridforum.org > Subject: Re: [dfdl-wg] Annotation complexity > > > Sorry the elements in the triple were all supposed to be of a simple > type e.g.: > > <xs:complexType name="triple"> > <xs:annotation> > <xs:appinfo> > <dfdlFromBinary/> > </xs:appinfo> > </xs:annotation> > <xs:sequence> > <xs:element name="first" type="xs:int"/> > <xs:element name="second" type="xs:int"/> > <xs:element name="third" type="xs:int"/> > </xs:sequence> > </xs:complexType> > > > <xs:complexType name="data"> > <xs:annotation> > <xs:appinfo> > <dfdlFromStrings/> > </xs:appinfo> > </xs:annotation> > <xs:sequence> > <xs:element name="triple"/> > </xs:sequence> > </xs:complexType> > > Martin Westhead wrote: > > Hi, > > > > I think I understand Suman's issue with annotations on the Schema > > tree. > > (Please Suman tell me if I am right here). The problem is, that > > lexically there are many trees in an XSD. Whilst in > practice these can > > clearly be considered as a single tree (including, I think, > even the > > simple type hierarchies) by placing all the type > definitions inline, > > this is not the way they appear to the user. So for example > if I have a > > file with conflicting annotations looking like: > > > > <xs:complexType name="triple"> > > <xs:annotation> > > <xs:appinfo> > > <dfdlFromBinary/> > > </xs:appinfo> > > </xs:annotation> > > <xs:sequence> > > <xs:element name="first" type="xs:int"/> > > <xs:element name="second"/> > > <xs:element name="third"/> > > </xs:sequence> > > </xs:complexType> > > > > > > <xs:complexType name="data"> > > <xs:annotation> > > <xs:appinfo> > > <dfdlFromStrings/> > > </xs:appinfo> > > </xs:annotation> > > <xs:sequence> > > <xs:element name="triple"/> > > </xs:sequence> > > </xs:complexType> > > > > So what I imagined is that we would assume that the "triple" type is > > considered _inside_ the scope of the "data" type and so the > > "dfdlFromBinary" tag wins. > > > > On the other hand the user sees two trees of equal depth with > > conflicting annotations. The examples can obviously get much more > > intricate. > > > > The issue is really that the scope of the annotations is > not lexically > > defined. At some level this is just like having globally included > > variables in a programming language. On the other hand we > have arbitrary > > levels of these. > > > > Suman is this the problem? > > > > If this is the problem, and we agree that it is too confusing to the > > user (my opinion is still out on this). Then I see that the > conclusion > > is to adopt an approach similar to IBM's that annotations > can appear > > only on <element> and <attribute> tags. Even the top level > of the file > > is confusing since there may be many files involved. I > guess we can also > > have runtime defaults and default settings set in the > standard. I don't > > like this conclusion incidentally, can someone convince me > it is the > > wrong one? > > > > Martin > > > > > > > > > > > > > > > > > >

2 1

RE: [dfdl-wg] Annotation complexity
by mike.beckerle＠ascentialsoftware.com 19 Nov '04

19 Nov '04

I don't see the problem here. > > An attempt at a picture where only elements have annotations: > > Element A : param=littleendian > SubElement B: type ST > Type T: > SubElement C: param:bigendian > > Type ST: subtype of T > > What is the param value of element C at A/B/C? > I propose this reasoning. Local element definition has highest precedence. Then type of element including subtype-based inheritance. Then lexical scope. So the param value of element C is bigEndian. To me this is clearly correct. However, I think the type names T and ST better reflect that these include representation information otherwise the user writing the XSD for A with subelement B will be surprised. E.g., Type T could be MainframeCobolComp3 and type ST could be MainframeCobolComp3_8_2 meaning with restrictions to 8 digits and 2 fractional digits. The point is that these type names should reflect that these types include representation matters, otherwise why wouldn't someone use them in a context where they think they still have control of the byteOrder. ...mikeb

1 0

Fw: [dfdl-wg] Annotation complexity
by Suman Kalia 19 Nov '04

19 Nov '04

Yes -- the annotations defined on the element stays on the element and do not follow the contain hierarchy. This is not ideal but reduces the complexity of overrides between element and types.. Suman Kalia IBM Toronto Lab WebSphere Business Integration Application Connectivity Tools Tel : 905-413-3923 T/L 969-3923 Fax : 905-413-4850 Internet ID : kalia(a)ca.ibm.com ----- Forwarded by Suman Kalia/Toronto/IBM on 11/19/2004 04:22 PM ----- "Myers, James D" <jim.myers(a)pnl.gov> Sent by: owner-dfdl-wg(a)ggf.org 11/19/2004 04:12 PM To dfdl-wg(a)gridforum.org cc Subject RE: [dfdl-wg] Annotation complexity I think I get the issue, but not necessarily the proposed solution. I can have an element with a complex type that contains other elements with complex types and I might want params such as littleendian to follow that hierarchy, independent of the type derivation hierarchy. Are you saying that inheritance should never flow down the 'contains' hierarchy? Jim PS. After two days of this in December, I may need help in getting myself to the airport ... ;-) -----Original Message----- From: owner-dfdl-wg(a)ggf.org [mailto:owner-dfdl-wg@ggf.org] On Behalf Of Suman Kalia Sent: Friday, November 19, 2004 3:47 PM To: dfdl-wg(a)gridforum.org Subject: Fw: [dfdl-wg] Annotation complexity You are beginning to see the complexities of overrides in this very simple example. Consider complex type hierarchy which is four/five levels deep and then determine which override is applicable and where in the tree. I briefly mentioned in the call on Wednesday, that we should carefully determine which annotations are applicable for which constructs of XML schema and try to avoid the override mechanism. Annotations that exist on the element belong only to the element; there is no inheritance or override issue here. Annotations that exist on structural constructs (group/complex types etc) truly belong to the structure only ( such as data element separators, delimiter which cannot be associated with elements because elements could be reused via element Ref in other structures where they could have different delimiter in that structure etc). Once we have separated the types and elements; then annotations defined on derived types can follow the well established rules of inheritance ie inherit annotation from parent unless explicitly overridden etc.. Then comes the issue of defaults - where to locate and apply. Possible options are a) top level type ( which for example could be type corresponding to 01 level COBOL structure) b) A separate structure available at tooling and runtime which contains the defaults. We used the latter in our implementation. Suman Kalia IBM Toronto Lab WebSphere Business Integration Application Connectivity Tools Tel : 905-413-3923 T/L 969-3923 Fax : 905-413-4850 Internet ID : kalia(a)ca.ibm.com ----- Forwarded by Suman Kalia/Toronto/IBM on 11/19/2004 03:04 PM ----- "Myers, James D" <jim.myers(a)pnl.gov> Sent by: owner-dfdl-wg(a)ggf.org 11/19/2004 02:25 PM To dfdl-wg(a)gridforum.org cc Subject RE: [dfdl-wg] Annotation complexity I guess I'm not sure how restricting annotations to elements solves things. I think I can recreate the problems in Martin's examples without putting annotations on types: The issue of it being hard to understand that triple overrides the dfdlfromstrings param would seem to be the same whether the triple type has an annotation or if some subelements within it get annotations (either first second and third, or consider a triple type that specifies an annotated element containing those three). In all these cases, it is clear that you have to walk down the logical hierarchy which is broken into parts in the dfdl/xsd file and keep a stack of contexts if we allow any default/scoped annotations. If annotations are allowed on both types and elements, what I find even more difficult are situations where the triple type has one default, and the element in "data" with that type has an annotation specifying the opposite param value. Do we consider the element to be above the type in the scope hierarchy? For more fun, what if triple is derived some other type where the annotation is defined. Would an annotation on the "data" element be inherited by the sub-element of type triple, or would the inheritance from the triple base type win (i.e. neither the element of the type triple or the triple type itself are directly annotated). (Or consider an annotation on the "first" element defined in the base type for triple rather than on the base type directly - does the element annotation inherited from the type hierarchy trump the one from the element hierarchy?) An attempt at a picture where only elements have annotations: Element A : param=littleendian SubElement B: type ST Type T: SubElement C: param: bigendian Type ST: subtype of T What is the param value of element C at A/B/C? I guess I see a need to keep some hierarchically scoped defaults (a file that has some ascii info and then a base64 encoded section of littleendian stuff), but xsd makes it hard to define a single hierarchy. Perhaps some rule of precedence - resolve annotations from type to subtype first, then push those onto the stack of element scopes - would make things unambiguous, if not user friendly. Jim > -----Original Message----- > From: owner-dfdl-wg(a)ggf.org [mailto:owner-dfdl-wg@ggf.org] On > Behalf Of Martin Westhead > Sent: Friday, November 19, 2004 1:38 PM > To: Martin Westhead > Cc: dfdl-wg(a)gridforum.org > Subject: Re: [dfdl-wg] Annotation complexity > > > Sorry the elements in the triple were all supposed to be of a simple > type e.g.: > > <xs:complexType name="triple"> > <xs:annotation> > <xs:appinfo> > <dfdlFromBinary/> > </xs:appinfo> > </xs:annotation> > <xs:sequence> > <xs:element name="first" type="xs:int"/> > <xs:element name="second" type="xs:int"/> > <xs:element name="third" type="xs:int"/> > </xs:sequence> > </xs:complexType> > > > <xs:complexType name="data"> > <xs:annotation> > <xs:appinfo> > <dfdlFromStrings/> > </xs:appinfo> > </xs:annotation> > <xs:sequence> > <xs:element name="triple"/> > </xs:sequence> > </xs:complexType> > > Martin Westhead wrote: > > Hi, > > > > I think I understand Suman's issue with annotations on the Schema > > tree. > > (Please Suman tell me if I am right here). The problem is, that > > lexically there are many trees in an XSD. Whilst in > practice these can > > clearly be considered as a single tree (including, I think, > even the > > simple type hierarchies) by placing all the type > definitions inline, > > this is not the way they appear to the user. So for example > if I have a > > file with conflicting annotations looking like: > > > > <xs:complexType name="triple"> > > <xs:annotation> > > <xs:appinfo> > > <dfdlFromBinary/> > > </xs:appinfo> > > </xs:annotation> > > <xs:sequence> > > <xs:element name="first" type="xs:int"/> > > <xs:element name="second"/> > > <xs:element name="third"/> > > </xs:sequence> > > </xs:complexType> > > > > > > <xs:complexType name="data"> > > <xs:annotation> > > <xs:appinfo> > > <dfdlFromStrings/> > > </xs:appinfo> > > </xs:annotation> > > <xs:sequence> > > <xs:element name="triple"/> > > </xs:sequence> > > </xs:complexType> > > > > So what I imagined is that we would assume that the "triple" type is > > considered _inside_ the scope of the "data" type and so the > > "dfdlFromBinary" tag wins. > > > > On the other hand the user sees two trees of equal depth with > > conflicting annotations. The examples can obviously get much more > > intricate. > > > > The issue is really that the scope of the annotations is > not lexically > > defined. At some level this is just like having globally included > > variables in a programming language. On the other hand we > have arbitrary > > levels of these. > > > > Suman is this the problem? > > > > If this is the problem, and we agree that it is too confusing to the > > user (my opinion is still out on this). Then I see that the > conclusion > > is to adopt an approach similar to IBM's that annotations > can appear > > only on <element> and <attribute> tags. Even the top level > of the file > > is confusing since there may be many files involved. I > guess we can also > > have runtime defaults and default settings set in the > standard. I don't > > like this conclusion incidentally, can someone convince me > it is the > > wrong one? > > > > Martin > > > > > > > > > > > > > > > > > >

1 0

RE: [dfdl-wg] Annotation complexity
by Myers, James D 19 Nov '04

19 Nov '04

I think I get the issue, but not necessarily the proposed solution. I can have an element with a complex type that contains other elements with complex types and I might want params such as littleendian to follow that hierarchy, independent of the type derivation hierarchy. Are you saying that inheritance should never flow down the 'contains' hierarchy? Jim PS. After two days of this in December, I may need help in getting myself to the airport ... ;-) -----Original Message----- From: owner-dfdl-wg(a)ggf.org [mailto:owner-dfdl-wg@ggf.org] On Behalf Of Suman Kalia Sent: Friday, November 19, 2004 3:47 PM To: dfdl-wg(a)gridforum.org Subject: Fw: [dfdl-wg] Annotation complexity You are beginning to see the complexities of overrides in this very simple example. Consider complex type hierarchy which is four/five levels deep and then determine which override is applicable and where in the tree. I briefly mentioned in the call on Wednesday, that we should carefully determine which annotations are applicable for which constructs of XML schema and try to avoid the override mechanism. Annotations that exist on the element belong only to the element; there is no inheritance or override issue here. Annotations that exist on structural constructs (group/complex types etc) truly belong to the structure only ( such as data element separators, delimiter which cannot be associated with elements because elements could be reused via element Ref in other structures where they could have different delimiter in that structure etc). Once we have separated the types and elements; then annotations defined on derived types can follow the well established rules of inheritance ie inherit annotation from parent unless explicitly overridden etc.. Then comes the issue of defaults - where to locate and apply. Possible options are a) top level type ( which for example could be type corresponding to 01 level COBOL structure) b) A separate structure available at tooling and runtime which contains the defaults. We used the latter in our implementation. Suman Kalia IBM Toronto Lab WebSphere Business Integration Application Connectivity Tools Tel : 905-413-3923 T/L 969-3923 Fax : 905-413-4850 Internet ID : kalia(a)ca.ibm.com ----- Forwarded by Suman Kalia/Toronto/IBM on 11/19/2004 03:04 PM ----- "Myers, James D" <jim.myers(a)pnl.gov> Sent by: owner-dfdl-wg(a)ggf.org 11/19/2004 02:25 PM To dfdl-wg(a)gridforum.org cc Subject RE: [dfdl-wg] Annotation complexity I guess I'm not sure how restricting annotations to elements solves things. I think I can recreate the problems in Martin's examples without putting annotations on types: The issue of it being hard to understand that triple overrides the dfdlfromstrings param would seem to be the same whether the triple type has an annotation or if some subelements within it get annotations (either first second and third, or consider a triple type that specifies an annotated element containing those three). In all these cases, it is clear that you have to walk down the logical hierarchy which is broken into parts in the dfdl/xsd file and keep a stack of contexts if we allow any default/scoped annotations. If annotations are allowed on both types and elements, what I find even more difficult are situations where the triple type has one default, and the element in "data" with that type has an annotation specifying the opposite param value. Do we consider the element to be above the type in the scope hierarchy? For more fun, what if triple is derived some other type where the annotation is defined. Would an annotation on the "data" element be inherited by the sub-element of type triple, or would the inheritance from the triple base type win (i.e. neither the element of the type triple or the triple type itself are directly annotated). (Or consider an annotation on the "first" element defined in the base type for triple rather than on the base type directly - does the element annotation inherited from the type hierarchy trump the one from the element hierarchy?) An attempt at a picture where only elements have annotations: Element A : param=littleendian SubElement B: type ST Type T: SubElement C: param: bigendian Type ST: subtype of T What is the param value of element C at A/B/C? I guess I see a need to keep some hierarchically scoped defaults (a file that has some ascii info and then a base64 encoded section of littleendian stuff), but xsd makes it hard to define a single hierarchy. Perhaps some rule of precedence - resolve annotations from type to subtype first, then push those onto the stack of element scopes - would make things unambiguous, if not user friendly. Jim > -----Original Message----- > From: owner-dfdl-wg(a)ggf.org [mailto:owner-dfdl-wg@ggf.org] On > Behalf Of Martin Westhead > Sent: Friday, November 19, 2004 1:38 PM > To: Martin Westhead > Cc: dfdl-wg(a)gridforum.org > Subject: Re: [dfdl-wg] Annotation complexity > > > Sorry the elements in the triple were all supposed to be of a simple > type e.g.: > > <xs:complexType name="triple"> > <xs:annotation> > <xs:appinfo> > <dfdlFromBinary/> > </xs:appinfo> > </xs:annotation> > <xs:sequence> > <xs:element name="first" type="xs:int"/> > <xs:element name="second" type="xs:int"/> > <xs:element name="third" type="xs:int"/> > </xs:sequence> > </xs:complexType> > > > <xs:complexType name="data"> > <xs:annotation> > <xs:appinfo> > <dfdlFromStrings/> > </xs:appinfo> > </xs:annotation> > <xs:sequence> > <xs:element name="triple"/> > </xs:sequence> > </xs:complexType> > > Martin Westhead wrote: > > Hi, > > > > I think I understand Suman's issue with annotations on the Schema > > tree. > > (Please Suman tell me if I am right here). The problem is, that > > lexically there are many trees in an XSD. Whilst in > practice these can > > clearly be considered as a single tree (including, I think, > even the > > simple type hierarchies) by placing all the type > definitions inline, > > this is not the way they appear to the user. So for example > if I have a > > file with conflicting annotations looking like: > > > > <xs:complexType name="triple"> > > <xs:annotation> > > <xs:appinfo> > > <dfdlFromBinary/> > > </xs:appinfo> > > </xs:annotation> > > <xs:sequence> > > <xs:element name="first" type="xs:int"/> > > <xs:element name="second"/> > > <xs:element name="third"/> > > </xs:sequence> > > </xs:complexType> > > > > > > <xs:complexType name="data"> > > <xs:annotation> > > <xs:appinfo> > > <dfdlFromStrings/> > > </xs:appinfo> > > </xs:annotation> > > <xs:sequence> > > <xs:element name="triple"/> > > </xs:sequence> > > </xs:complexType> > > > > So what I imagined is that we would assume that the "triple" type is > > considered _inside_ the scope of the "data" type and so the > > "dfdlFromBinary" tag wins. > > > > On the other hand the user sees two trees of equal depth with > > conflicting annotations. The examples can obviously get much more > > intricate. > > > > The issue is really that the scope of the annotations is > not lexically > > defined. At some level this is just like having globally included > > variables in a programming language. On the other hand we > have arbitrary > > levels of these. > > > > Suman is this the problem? > > > > If this is the problem, and we agree that it is too confusing to the > > user (my opinion is still out on this). Then I see that the > conclusion > > is to adopt an approach similar to IBM's that annotations > can appear > > only on <element> and <attribute> tags. Even the top level > of the file > > is confusing since there may be many files involved. I > guess we can also > > have runtime defaults and default settings set in the > standard. I don't > > like this conclusion incidentally, can someone convince me > it is the > > wrong one? > > > > Martin > > > > > > > > > > > > > > > > > >

1 0

RE: [dfdl-wg] Annotation complexity
by Myers, James D 19 Nov '04

19 Nov '04

I guess I'm not sure how restricting annotations to elements solves things. I think I can recreate the problems in Martin's examples without putting annotations on types: The issue of it being hard to understand that triple overrides the dfdlfromstrings param would seem to be the same whether the triple type has an annotation or if some subelements within it get annotations (either first second and third, or consider a triple type that specifies an annotated element containing those three). In all these cases, it is clear that you have to walk down the logical hierarchy which is broken into parts in the dfdl/xsd file and keep a stack of contexts if we allow any default/scoped annotations. If annotations are allowed on both types and elements, what I find even more difficult are situations where the triple type has one default, and the element in "data" with that type has an annotation specifying the opposite param value. Do we consider the element to be above the type in the scope hierarchy? For more fun, what if triple is derived some other type where the annotation is defined. Would an annotation on the "data" element be inherited by the sub-element of type triple, or would the inheritance from the triple base type win (i.e. neither the element of the type triple or the triple type itself are directly annotated). (Or consider an annotation on the "first" element defined in the base type for triple rather than on the base type directly - does the element annotation inherited from the type hierarchy trump the one from the element hierarchy?) An attempt at a picture where only elements have annotations: Element A : param=littleendian SubElement B: type ST Type T: SubElement C: param: bigendian Type ST: subtype of T What is the param value of element C at A/B/C? I guess I see a need to keep some hierarchically scoped defaults (a file that has some ascii info and then a base64 encoded section of littleendian stuff), but xsd makes it hard to define a single hierarchy. Perhaps some rule of precedence - resolve annotations from type to subtype first, then push those onto the stack of element scopes - would make things unambiguous, if not user friendly. Jim > -----Original Message----- > From: owner-dfdl-wg(a)ggf.org [mailto:owner-dfdl-wg@ggf.org] On > Behalf Of Martin Westhead > Sent: Friday, November 19, 2004 1:38 PM > To: Martin Westhead > Cc: dfdl-wg(a)gridforum.org > Subject: Re: [dfdl-wg] Annotation complexity > > > Sorry the elements in the triple were all supposed to be of a simple > type e.g.: > > <xs:complexType name="triple"> > <xs:annotation> > <xs:appinfo> > <dfdlFromBinary/> > </xs:appinfo> > </xs:annotation> > <xs:sequence> > <xs:element name="first" type="xs:int"/> > <xs:element name="second" type="xs:int"/> > <xs:element name="third" type="xs:int"/> > </xs:sequence> > </xs:complexType> > > > <xs:complexType name="data"> > <xs:annotation> > <xs:appinfo> > <dfdlFromStrings/> > </xs:appinfo> > </xs:annotation> > <xs:sequence> > <xs:element name="triple"/> > </xs:sequence> > </xs:complexType> > > Martin Westhead wrote: > > Hi, > > > > I think I understand Suman's issue with annotations on the Schema > > tree. > > (Please Suman tell me if I am right here). The problem is, that > > lexically there are many trees in an XSD. Whilst in > practice these can > > clearly be considered as a single tree (including, I think, > even the > > simple type hierarchies) by placing all the type > definitions inline, > > this is not the way they appear to the user. So for example > if I have a > > file with conflicting annotations looking like: > > > > <xs:complexType name="triple"> > > <xs:annotation> > > <xs:appinfo> > > <dfdlFromBinary/> > > </xs:appinfo> > > </xs:annotation> > > <xs:sequence> > > <xs:element name="first" type="xs:int"/> > > <xs:element name="second"/> > > <xs:element name="third"/> > > </xs:sequence> > > </xs:complexType> > > > > > > <xs:complexType name="data"> > > <xs:annotation> > > <xs:appinfo> > > <dfdlFromStrings/> > > </xs:appinfo> > > </xs:annotation> > > <xs:sequence> > > <xs:element name="triple"/> > > </xs:sequence> > > </xs:complexType> > > > > So what I imagined is that we would assume that the "triple" type is > > considered _inside_ the scope of the "data" type and so the > > "dfdlFromBinary" tag wins. > > > > On the other hand the user sees two trees of equal depth with > > conflicting annotations. The examples can obviously get much more > > intricate. > > > > The issue is really that the scope of the annotations is > not lexically > > defined. At some level this is just like having globally included > > variables in a programming language. On the other hand we > have arbitrary > > levels of these. > > > > Suman is this the problem? > > > > If this is the problem, and we agree that it is too confusing to the > > user (my opinion is still out on this). Then I see that the > conclusion > > is to adopt an approach similar to IBM's that annotations > can appear > > only on <element> and <attribute> tags. Even the top level > of the file > > is confusing since there may be many files involved. I > guess we can also > > have runtime defaults and default settings set in the > standard. I don't > > like this conclusion incidentally, can someone convince me > it is the > > wrong one? > > > > Martin > > > > > > > > > > > > > > > > > >

2 1

Fw: [dfdl-wg] simple way to study hard DFDL example problem - IBMFormat VS rec ords as XML
by Suman Kalia 19 Nov '04

19 Nov '04

Jim -- I agree with most of your assertions and you have phrased it right "relatively compliant with physical structure". Some of these examples from programming languages would be " COBOL occur depending upon clause" and as you mentioned in the example "a previous value in the structure indicating which field in the choice will be present or how many occurrences a subsequent field will have" etc.. These are the most common kind of constructs that occur quite frequently in the programming structures. I think DFDL standard is addressing a very critical requirement "rendering a logical structure to a relatively compliant physical format and vice versa" which no other public standard has addressed so far to my knowledge and this work is/will be very complimentary with other standards. Suman Kalia IBM Toronto Lab WebSphere Business Integration Application Connectivity Tools Tel : 905-413-3923 T/L 969-3923 Fax : 905-413-4850 Internet ID : kalia(a)ca.ibm.com ----- Forwarded by Suman Kalia/Toronto/IBM on 11/19/2004 02:05 PM ----- "Myers, James D" <jim.myers(a)pnl.gov> Sent by: owner-dfdl-wg(a)ggf.org 11/19/2004 12:34 PM To dfdl-wg(a)gridforum.org cc Subject RE: [dfdl-wg] simple way to study hard DFDL example problem - IBMFormat VS rec ords as XML Unfortuantely, there's a slippery slope here - there are no ints on the disk, just logical ones and zeros that you can transform into a second logical structure composed of ints, assuming you specify byte order. I think we have a whole stream of examples beyond that - removing delimiters, using a length prefix to define the length of a subsequent structure, etc. - that we see as minor transformations to something still relatively "compliant" with the physical structure, but, I believe, require the same machinery as things I think we will all agree are beyond the scope of what DFDL should aim for. In practice, I think people should get out of DFDL as soon as possible just as you say - use other technologies once you get an initial structure. But I think there are cases where you have to stay in DFDL - anything where I have to transform the initial physically-compliant structure to interpret subsequent fields - x and y ints tell me how many pixel repeats, an int greater than another int read previsouly implies a different subsequent structure, etc. And again, the minimal mechinery to do that lets you go farther than you'd want people to go in practice. There may also be reasonable use cases where the ability to stay in DFDL is important. For example, take digital preservation, where I might want to map all document files to a standardized schema, regardless of whether it was word, pdf, etc. Being able to specify the full descriptions in one file that then requires only one parser to interpret all formats *might* be worth the cost to do complex things in DFDL. I don't think our goal for a version 1 should be to support such use, but I don't think we can meet our simple goals without 'accidentally' making it possible. I'd be happy to be proved wrong - seems like a deep point that would be cool to understand. I'm not sure how we get to a 'proof' though - we're trying to prove that there exists something DFDL as currently formulated can't describe. So - we either need to find that example or turn to some sort of logic formalism to discover what primitive(s) we're missing that keep us for emulating some class of parser/programming. (Or find something in DFDL that we don't need to support the examples we do want to target...). Jim -----Original Message----- From: owner-dfdl-wg(a)ggf.org [mailto:owner-dfdl-wg@ggf.org] On Behalf Of Suman Kalia Sent: Friday, November 19, 2004 11:50 AM To: dfdl-wg(a)gridforum.org Subject: Fw: [dfdl-wg] simple way to study hard DFDL example problem - IBMFormat VS rec ords as XML I tend to agree that there 2 inherent logical structures in this scenario. DFDL scope in my option should be restricted to parsing the physical stream and populating the logical structure which is complaint with the structure of physical stream and vice versa. We have numerous options and technologies (XSLT, XSD<->XSD mappers, good old programming languages, Xquery) which do pretty good job to transform one logical structure to another logical structure. Building some kinds of annotations which would allow a physical stream to map to a completely different logical structure will make the DFDL language very complex. Suman Kalia IBM Toronto Lab WebSphere Business Integration Application Connectivity Tools Tel : 905-413-3923 T/L 969-3923 Fax : 905-413-4850 Internet ID : kalia(a)ca.ibm.com ----- Forwarded by Suman Kalia/Toronto/IBM on 11/19/2004 11:36 AM ----- "Myers, James D" <jim.myers(a)pnl.gov> Sent by: owner-dfdl-wg(a)ggf.org 11/19/2004 11:05 AM To dfdl-wg(a)gridforum.org cc Subject RE: [dfdl-wg] simple way to study hard DFDL example problem - IBMFormat VS rec ords as XML I was thinking that step 1 involved recognizing the <first/> and <data> elements and creating a sequence of <myfirst>here's the data</myfirst>, <mymiddle>more data</mymiddle> and <mylast>... elements and then assembling the new layer by some sort of choice to concatenate the relevant myfirst, optional mymiddle, and myend elements for each item. I think that requires a way to make a choice based on the <first/>, <middle/>, <last/> elements and populate either a <myfirst>, <mymiddle>, or <mylast> elements (all subtypes of string?) with the contents of the following data element, which I think we can do in DFDL. This is just our standard choice flag that decides which of several options exist. Then, I think you'd need logic to decide how many elements represent one item, which I think we have, followed by a way to concatenate these elements to produce a string source, which again I think we have (same as saying a complex can be built from two floats referenced from another layer instead of from a float stream). This part is the same problem as having a text file where one <CR> separates lines and <CR><CR> separates paragraphs and you want to create single strings (from a variable number of lines) for each paragraph. Again, I won't argue that this is simple and fun, but I think the machinery exists and is the same as that from our simple examples. Jim -----Original Message----- From: owner-dfdl-wg(a)ggf.org [mailto:owner-dfdl-wg@ggf.org] On Behalf Of mike.beckerle(a)ascentialsoftware.com Sent: Friday, November 19, 2004 10:44 AM To: Myers, James D; dfdl-wg(a)gridforum.org Subject: RE: [dfdl-wg] simple way to study hard DFDL example problem - IBMFormat VS rec ords as XML You are thinking along the lines I was; however, the challenge is that I cannot find a way to do this using multilayer so I'm uncomfortable suggesting that it's possible at all anymore. Here's some reasoning why. In particular, it's the intersection of the induction across the items with the first, middle*, last thing, and the spanning that seems to defy my efforts to cut it up into progressive transformation layer by layer. In some conversations I've referred to this problem as the "non-conforming trees" problem. The fundamental shapes of the trees are not compatible, and expressing the transformation between them isn't easily done via induction of any kind on one or the other of the trees. To me the First, Middle*, Last thing is very problematic. It's effectively a little regular language (in the formal sense) that has to be recognized. Generally this requires a finite-state-machine, and what makes FSMs interesting and complex is always the way you diagnose malformed data in addition to recognizing correct data. Now, a finite-state-machine is, to my mind, the ultimate procedural abstraction, the quintessential opposite of "declarative" expression. To be declarative about a FSM you end up saying "recognize this regular language", and providing a description of the regular language, which is of course, just begging the question of how it actually works. (And for us, we're not really talking about a regular language of character text, but a pattern of usage in the binary data layout that obeys the pattern of a regular language. So it's not like having a little regular expression thing for validating text strings helps with this problem.) I guess I'm arguing that a black box approach to this is not only acceptable, but is highly likely to be the only "good" way to do it. In light of this I've suggested a rep property called "streamFormat" (perhaps should be renamed "recordFormat"), which gets values from the set VS, V, VBS, FB, FBS, etc. etc. all these well-defined legacy data formats (there are 19 of them I think). In additon, one should be able to extend this by introduction of a blackbox transformation. And ... here's the rub...if that's true for this case, then other "hard" examples like run-length encoding seem also in this category. There's several "leaps of faith" just made in these arguments, so i'd still like people to take this "XML challenge" and see if there's some magic I'm overlooking. ...mikeb From: Myers, James D [mailto:jim.myers@pnl.gov] Sent: Friday, November 19, 2004 9:52 AM To: dfdl-wg(a)gridforum.org Subject: RE: [dfdl-wg] simple way to study hard DFDL example problem - IBM Format VS rec ords as XML Without digging too much into the details, I'd say this is an example where multi-layer comes in. The DFDL would describe a hidden layer in which the first, middle, last data elements would be identified and put into a list, and then that hidden list would be used as the input to create items in the output layer. I think this is conceptually similar to one of our run-length encoding examples (more complex of course). If you read a sequence if ints and then a sequence of floats and need to output a sequence of floats with int[i] repeats of float[i], it would be easiest to create a hidden layer representing the int and float sequences and to then produce output from that. If you don't think about a layer, even this example gets painful - I need to read an int, skip forward somewhere to find a float, skip back to get the next int, etc. Mike's full example, not starting with the XML-ized version, might be something that requires more than one layer - read the original into something with with XML schema Mike defines, then a layer making a sequence of data elements, and then something that has the desired logical output. I guess I would claim that this would not be too bad a way to describe a fairly complex format in terms of a fairly different logical structure. Whether one *should* do this in DFDL, or whether it would make more sense to a) write a black box parser to get to items, or b) use DFDL to get to the initial schema Mike wrote and use XSLT afterwards to convert to the desired logical structure. I think there are enough cases where we need the multilayer functionality in DFDL that are relatively simple that we have to have it, which means it will then be possible to deal with complex transformations in DFDL even if not simple/practical. Jim -----Original Message----- From: owner-dfdl-wg(a)ggf.org [mailto:owner-dfdl-wg@ggf.org] On Behalf Of mike.beckerle(a)ascentialsoftware.com Sent: Thursday, November 18, 2004 9:53 PM To: dfdl-wg(a)gridforum.org Subject: [dfdl-wg] simple way to study hard DFDL example problem - IBM Format VS rec ords as XML I've come up with a way to articulate the difficulties I'm having with DFDL for complex file formats. This problem may not be that hard for someone with more XML, XPath or XQuery experience, so I'd apprecate it if you could look it over and if necessary even run it by your resident XML experts. In case the emailer mangles all the line lengths, I've also attached the below as a file.    <ITEM>The first item</ITEM> <ITEM>This is the second item</ITEM> <ITEM>The third</ITEM>  <sequence> <element name="ITEM" type="string" minOccurs="0" maxOccurs="unbounded"/> </sequence>  <BLOCK> <SEGMENT> <WHOLE/>  <DATA>The first item</DATA> </SEGMENT> </BLOCK> <BLOCK> <SEGMENT> <FIRST/>  <DATA>Thi</DATA> </SEGMENT> </BLOCK> <BLOCK> <SEGMENT> <MIDDLE/>  <DATA>s is t</DATA> </SEGMENT> </BLOCK> <BLOCK> <SEGMENT> <MIDDLE/> <DATA>he sec</DATA> </SEGMENT> </BLOCK> <BLOCK> <SEGMENT> <LAST/>  <DATA>ond item</DATA> </SEGMENT> <SEGMENT> <WHOLE/> <DATA>Third item</DATA> </SEGMENT> </BLOCK>          <complexType name="Format_VS_t"> <sequence> <element name="BLOCK" type="Block_t" minOccurs="0" maxOccurs="unbounded"/> </sequence> </complexType> <complexType name="Block_t"> <sequence> <element name="SEGMENT" type="Segment_t" minOccurs="1" maxOccurs="2"/> </sequence> </complexType> <complexType name="Segment_t"> <sequence> <choice> <element name="WHOLE"> </element> <element name="FIRST"> </element> <element name="LAST"> </element> <element name="MIDDLE"> </element> </choice> <element name="DATA" type="string"/> </sequence> </complexType>

1 0

Fw: [dfdl-wg] simple way to study hard DFDL example problem - IBMFormat VS rec ords as XML
by Suman Kalia 19 Nov '04

19 Nov '04

Jim -- I agree with most of your assertions and you have phrased it right "relatively compliant with physical structure". Some of these examples from programming languages would be " COBOL occur depending upon clause" and as you mentioned in the example "a previous value in the structure indicating which field in the choice will be present or how many occurrences a subsequent field will have" etc.. These are the most common kind of constructs that occur quite frequently in the programming structures. I think DFDL standard is addressing a very critical requirement "rendering a logical structure to physical format and vice versa" which no other public standard has addressed so far to my knowledge and this work is/will be very complimentary with other standards. Suman Kalia IBM Toronto Lab WebSphere Business Integration Application Connectivity Tools Tel : 905-413-3923 T/L 969-3923 Fax : 905-413-4850 Internet ID : kalia(a)ca.ibm.com ----- Forwarded by Suman Kalia/Toronto/IBM on 11/19/2004 02:05 PM ----- "Myers, James D" <jim.myers(a)pnl.gov> Sent by: owner-dfdl-wg(a)ggf.org 11/19/2004 12:34 PM To dfdl-wg(a)gridforum.org cc Subject RE: [dfdl-wg] simple way to study hard DFDL example problem - IBMFormat VS rec ords as XML Unfortuantely, there's a slippery slope here - there are no ints on the disk, just logical ones and zeros that you can transform into a second logical structure composed of ints, assuming you specify byte order. I think we have a whole stream of examples beyond that - removing delimiters, using a length prefix to define the length of a subsequent structure, etc. - that we see as minor transformations to something still relatively "compliant" with the physical structure, but, I believe, require the same machinery as things I think we will all agree are beyond the scope of what DFDL should aim for. In practice, I think people should get out of DFDL as soon as possible just as you say - use other technologies once you get an initial structure. But I think there are cases where you have to stay in DFDL - anything where I have to transform the initial physically-compliant structure to interpret subsequent fields - x and y ints tell me how many pixel repeats, an int greater than another int read previsouly implies a different subsequent structure, etc. And again, the minimal mechinery to do that lets you go farther than you'd want people to go in practice. There may also be reasonable use cases where the ability to stay in DFDL is important. For example, take digital preservation, where I might want to map all document files to a standardized schema, regardless of whether it was word, pdf, etc. Being able to specify the full descriptions in one file that then requires only one parser to interpret all formats *might* be worth the cost to do complex things in DFDL. I don't think our goal for a version 1 should be to support such use, but I don't think we can meet our simple goals without 'accidentally' making it possible. I'd be happy to be proved wrong - seems like a deep point that would be cool to understand. I'm not sure how we get to a 'proof' though - we're trying to prove that there exists something DFDL as currently formulated can't describe. So - we either need to find that example or turn to some sort of logic formalism to discover what primitive(s) we're missing that keep us for emulating some class of parser/programming. (Or find something in DFDL that we don't need to support the examples we do want to target...). Jim -----Original Message----- From: owner-dfdl-wg(a)ggf.org [mailto:owner-dfdl-wg@ggf.org] On Behalf Of Suman Kalia Sent: Friday, November 19, 2004 11:50 AM To: dfdl-wg(a)gridforum.org Subject: Fw: [dfdl-wg] simple way to study hard DFDL example problem - IBMFormat VS rec ords as XML I tend to agree that there 2 inherent logical structures in this scenario. DFDL scope in my option should be restricted to parsing the physical stream and populating the logical structure which is complaint with the structure of physical stream and vice versa. We have numerous options and technologies (XSLT, XSD<->XSD mappers, good old programming languages, Xquery) which do pretty good job to transform one logical structure to another logical structure. Building some kinds of annotations which would allow a physical stream to map to a completely different logical structure will make the DFDL language very complex. Suman Kalia IBM Toronto Lab WebSphere Business Integration Application Connectivity Tools Tel : 905-413-3923 T/L 969-3923 Fax : 905-413-4850 Internet ID : kalia(a)ca.ibm.com ----- Forwarded by Suman Kalia/Toronto/IBM on 11/19/2004 11:36 AM ----- "Myers, James D" <jim.myers(a)pnl.gov> Sent by: owner-dfdl-wg(a)ggf.org 11/19/2004 11:05 AM To dfdl-wg(a)gridforum.org cc Subject RE: [dfdl-wg] simple way to study hard DFDL example problem - IBMFormat VS rec ords as XML I was thinking that step 1 involved recognizing the <first/> and <data> elements and creating a sequence of <myfirst>here's the data</myfirst>, <mymiddle>more data</mymiddle> and <mylast>... elements and then assembling the new layer by some sort of choice to concatenate the relevant myfirst, optional mymiddle, and myend elements for each item. I think that requires a way to make a choice based on the <first/>, <middle/>, <last/> elements and populate either a <myfirst>, <mymiddle>, or <mylast> elements (all subtypes of string?) with the contents of the following data element, which I think we can do in DFDL. This is just our standard choice flag that decides which of several options exist. Then, I think you'd need logic to decide how many elements represent one item, which I think we have, followed by a way to concatenate these elements to produce a string source, which again I think we have (same as saying a complex can be built from two floats referenced from another layer instead of from a float stream). This part is the same problem as having a text file where one <CR> separates lines and <CR><CR> separates paragraphs and you want to create single strings (from a variable number of lines) for each paragraph. Again, I won't argue that this is simple and fun, but I think the machinery exists and is the same as that from our simple examples. Jim -----Original Message----- From: owner-dfdl-wg(a)ggf.org [mailto:owner-dfdl-wg@ggf.org] On Behalf Of mike.beckerle(a)ascentialsoftware.com Sent: Friday, November 19, 2004 10:44 AM To: Myers, James D; dfdl-wg(a)gridforum.org Subject: RE: [dfdl-wg] simple way to study hard DFDL example problem - IBMFormat VS rec ords as XML You are thinking along the lines I was; however, the challenge is that I cannot find a way to do this using multilayer so I'm uncomfortable suggesting that it's possible at all anymore. Here's some reasoning why. In particular, it's the intersection of the induction across the items with the first, middle*, last thing, and the spanning that seems to defy my efforts to cut it up into progressive transformation layer by layer. In some conversations I've referred to this problem as the "non-conforming trees" problem. The fundamental shapes of the trees are not compatible, and expressing the transformation between them isn't easily done via induction of any kind on one or the other of the trees. To me the First, Middle*, Last thing is very problematic. It's effectively a little regular language (in the formal sense) that has to be recognized. Generally this requires a finite-state-machine, and what makes FSMs interesting and complex is always the way you diagnose malformed data in addition to recognizing correct data. Now, a finite-state-machine is, to my mind, the ultimate procedural abstraction, the quintessential opposite of "declarative" expression. To be declarative about a FSM you end up saying "recognize this regular language", and providing a description of the regular language, which is of course, just begging the question of how it actually works. (And for us, we're not really talking about a regular language of character text, but a pattern of usage in the binary data layout that obeys the pattern of a regular language. So it's not like having a little regular expression thing for validating text strings helps with this problem.) I guess I'm arguing that a black box approach to this is not only acceptable, but is highly likely to be the only "good" way to do it. In light of this I've suggested a rep property called "streamFormat" (perhaps should be renamed "recordFormat"), which gets values from the set VS, V, VBS, FB, FBS, etc. etc. all these well-defined legacy data formats (there are 19 of them I think). In additon, one should be able to extend this by introduction of a blackbox transformation. And ... here's the rub...if that's true for this case, then other "hard" examples like run-length encoding seem also in this category. There's several "leaps of faith" just made in these arguments, so i'd still like people to take this "XML challenge" and see if there's some magic I'm overlooking. ...mikeb From: Myers, James D [mailto:jim.myers@pnl.gov] Sent: Friday, November 19, 2004 9:52 AM To: dfdl-wg(a)gridforum.org Subject: RE: [dfdl-wg] simple way to study hard DFDL example problem - IBM Format VS rec ords as XML Without digging too much into the details, I'd say this is an example where multi-layer comes in. The DFDL would describe a hidden layer in which the first, middle, last data elements would be identified and put into a list, and then that hidden list would be used as the input to create items in the output layer. I think this is conceptually similar to one of our run-length encoding examples (more complex of course). If you read a sequence if ints and then a sequence of floats and need to output a sequence of floats with int[i] repeats of float[i], it would be easiest to create a hidden layer representing the int and float sequences and to then produce output from that. If you don't think about a layer, even this example gets painful - I need to read an int, skip forward somewhere to find a float, skip back to get the next int, etc. Mike's full example, not starting with the XML-ized version, might be something that requires more than one layer - read the original into something with with XML schema Mike defines, then a layer making a sequence of data elements, and then something that has the desired logical output. I guess I would claim that this would not be too bad a way to describe a fairly complex format in terms of a fairly different logical structure. Whether one *should* do this in DFDL, or whether it would make more sense to a) write a black box parser to get to items, or b) use DFDL to get to the initial schema Mike wrote and use XSLT afterwards to convert to the desired logical structure. I think there are enough cases where we need the multilayer functionality in DFDL that are relatively simple that we have to have it, which means it will then be possible to deal with complex transformations in DFDL even if not simple/practical. Jim -----Original Message----- From: owner-dfdl-wg(a)ggf.org [mailto:owner-dfdl-wg@ggf.org] On Behalf Of mike.beckerle(a)ascentialsoftware.com Sent: Thursday, November 18, 2004 9:53 PM To: dfdl-wg(a)gridforum.org Subject: [dfdl-wg] simple way to study hard DFDL example problem - IBM Format VS rec ords as XML I've come up with a way to articulate the difficulties I'm having with DFDL for complex file formats. This problem may not be that hard for someone with more XML, XPath or XQuery experience, so I'd apprecate it if you could look it over and if necessary even run it by your resident XML experts. In case the emailer mangles all the line lengths, I've also attached the below as a file.    <ITEM>The first item</ITEM> <ITEM>This is the second item</ITEM> <ITEM>The third</ITEM>  <sequence> <element name="ITEM" type="string" minOccurs="0" maxOccurs="unbounded"/> </sequence>  <BLOCK> <SEGMENT> <WHOLE/>  <DATA>The first item</DATA> </SEGMENT> </BLOCK> <BLOCK> <SEGMENT> <FIRST/>  <DATA>Thi</DATA> </SEGMENT> </BLOCK> <BLOCK> <SEGMENT> <MIDDLE/>  <DATA>s is t</DATA> </SEGMENT> </BLOCK> <BLOCK> <SEGMENT> <MIDDLE/> <DATA>he sec</DATA> </SEGMENT> </BLOCK> <BLOCK> <SEGMENT> <LAST/>  <DATA>ond item</DATA> </SEGMENT> <SEGMENT> <WHOLE/> <DATA>Third item</DATA> </SEGMENT> </BLOCK>          <complexType name="Format_VS_t"> <sequence> <element name="BLOCK" type="Block_t" minOccurs="0" maxOccurs="unbounded"/> </sequence> </complexType> <complexType name="Block_t"> <sequence> <element name="SEGMENT" type="Segment_t" minOccurs="1" maxOccurs="2"/> </sequence> </complexType> <complexType name="Segment_t"> <sequence> <choice> <element name="WHOLE"> </element> <element name="FIRST"> </element> <element name="LAST"> </element> <element name="MIDDLE"> </element> </choice> <element name="DATA" type="string"/> </sequence> </complexType>

1 0

Annotation complexity
by Martin Westhead 19 Nov '04

19 Nov '04

Hi, I think I understand Suman's issue with annotations on the Schema tree. (Please Suman tell me if I am right here). The problem is, that lexically there are many trees in an XSD. Whilst in practice these can clearly be considered as a single tree (including, I think, even the simple type hierarchies) by placing all the type definitions inline, this is not the way they appear to the user. So for example if I have a file with conflicting annotations looking like: <xs:complexType name="triple"> <xs:annotation> <xs:appinfo> <dfdlFromBinary/> </xs:appinfo> </xs:annotation> <xs:sequence> <xs:element name="first"/> <xs:element name="second"/> <xs:element name="third"/> </xs:sequence> </xs:complexType> <xs:complexType name="data"> <xs:annotation> <xs:appinfo> <dfdlFromStrings/> </xs:appinfo> </xs:annotation> <xs:sequence> <xs:element name="triple"/> </xs:sequence> </xs:complexType> So what I imagined is that we would assume that the "triple" type is considered _inside_ the scope of the "data" type and so the "dfdlFromBinary" tag wins. On the other hand the user sees two trees of equal depth with conflicting annotations. The examples can obviously get much more intricate. The issue is really that the scope of the annotations is not lexically defined. At some level this is just like having globally included variables in a programming language. On the other hand we have arbitrary levels of these. Suman is this the problem? If this is the problem, and we agree that it is too confusing to the user (my opinion is still out on this). Then I see that the conclusion is to adopt an approach similar to IBM's that annotations can appear only on <element> and <attribute> tags. Even the top level of the file is confusing since there may be many files involved. I guess we can also have runtime defaults and default settings set in the standard. I don't like this conclusion incidentally, can someone convince me it is the wrong one? Martin

1 1

RE: [dfdl-wg] simple way to study hard DFDL example problem - IBMFormat VS rec ords as XML
by Myers, James D 19 Nov '04

19 Nov '04

Unfortuantely, there's a slippery slope here - there are no ints on the disk, just logical ones and zeros that you can transform into a second logical structure composed of ints, assuming you specify byte order. I think we have a whole stream of examples beyond that - removing delimiters, using a length prefix to define the length of a subsequent structure, etc. - that we see as minor transformations to something still relatively "compliant" with the physical structure, but, I believe, require the same machinery as things I think we will all agree are beyond the scope of what DFDL should aim for. In practice, I think people should get out of DFDL as soon as possible just as you say - use other technologies once you get an initial structure. But I think there are cases where you have to stay in DFDL - anything where I have to transform the initial physically-compliant structure to interpret subsequent fields - x and y ints tell me how many pixel repeats, an int greater than another int read previsouly implies a different subsequent structure, etc. And again, the minimal mechinery to do that lets you go farther than you'd want people to go in practice. There may also be reasonable use cases where the ability to stay in DFDL is important. For example, take digital preservation, where I might want to map all document files to a standardized schema, regardless of whether it was word, pdf, etc. Being able to specify the full descriptions in one file that then requires only one parser to interpret all formats *might* be worth the cost to do complex things in DFDL. I don't think our goal for a version 1 should be to support such use, but I don't think we can meet our simple goals without 'accidentally' making it possible. I'd be happy to be proved wrong - seems like a deep point that would be cool to understand. I'm not sure how we get to a 'proof' though - we're trying to prove that there exists something DFDL as currently formulated can't describe. So - we either need to find that example or turn to some sort of logic formalism to discover what primitive(s) we're missing that keep us for emulating some class of parser/programming. (Or find something in DFDL that we don't need to support the examples we do want to target...). Jim -----Original Message----- From: owner-dfdl-wg(a)ggf.org [mailto:owner-dfdl-wg@ggf.org] On Behalf Of Suman Kalia Sent: Friday, November 19, 2004 11:50 AM To: dfdl-wg(a)gridforum.org Subject: Fw: [dfdl-wg] simple way to study hard DFDL example problem - IBMFormat VS rec ords as XML I tend to agree that there 2 inherent logical structures in this scenario. DFDL scope in my option should be restricted to parsing the physical stream and populating the logical structure which is complaint with the structure of physical stream and vice versa. We have numerous options and technologies (XSLT, XSD<->XSD mappers, good old programming languages, Xquery) which do pretty good job to transform one logical structure to another logical structure. Building some kinds of annotations which would allow a physical stream to map to a completely different logical structure will make the DFDL language very complex. Suman Kalia IBM Toronto Lab WebSphere Business Integration Application Connectivity Tools Tel : 905-413-3923 T/L 969-3923 Fax : 905-413-4850 Internet ID : kalia(a)ca.ibm.com ----- Forwarded by Suman Kalia/Toronto/IBM on 11/19/2004 11:36 AM ----- "Myers, James D" <jim.myers(a)pnl.gov> Sent by: owner-dfdl-wg(a)ggf.org 11/19/2004 11:05 AM To dfdl-wg(a)gridforum.org cc Subject RE: [dfdl-wg] simple way to study hard DFDL example problem - IBMFormat VS rec ords as XML I was thinking that step 1 involved recognizing the <first/> and <data> elements and creating a sequence of <myfirst>here's the data</myfirst>, <mymiddle>more data</mymiddle> and <mylast>... elements and then assembling the new layer by some sort of choice to concatenate the relevant myfirst, optional mymiddle, and myend elements for each item. I think that requires a way to make a choice based on the <first/>, <middle/>, <last/> elements and populate either a <myfirst>, <mymiddle>, or <mylast> elements (all subtypes of string?) with the contents of the following data element, which I think we can do in DFDL. This is just our standard choice flag that decides which of several options exist. Then, I think you'd need logic to decide how many elements represent one item, which I think we have, followed by a way to concatenate these elements to produce a string source, which again I think we have (same as saying a complex can be built from two floats referenced from another layer instead of from a float stream). This part is the same problem as having a text file where one <CR> separates lines and <CR><CR> separates paragraphs and you want to create single strings (from a variable number of lines) for each paragraph. Again, I won't argue that this is simple and fun, but I think the machinery exists and is the same as that from our simple examples. Jim -----Original Message----- From: owner-dfdl-wg(a)ggf.org [mailto:owner-dfdl-wg@ggf.org] On Behalf Of mike.beckerle(a)ascentialsoftware.com Sent: Friday, November 19, 2004 10:44 AM To: Myers, James D; dfdl-wg(a)gridforum.org Subject: RE: [dfdl-wg] simple way to study hard DFDL example problem - IBMFormat VS rec ords as XML You are thinking along the lines I was; however, the challenge is that I cannot find a way to do this using multilayer so I'm uncomfortable suggesting that it's possible at all anymore. Here's some reasoning why. In particular, it's the intersection of the induction across the items with the first, middle*, last thing, and the spanning that seems to defy my efforts to cut it up into progressive transformation layer by layer. In some conversations I've referred to this problem as the "non-conforming trees" problem. The fundamental shapes of the trees are not compatible, and expressing the transformation between them isn't easily done via induction of any kind on one or the other of the trees. To me the First, Middle*, Last thing is very problematic. It's effectively a little regular language (in the formal sense) that has to be recognized. Generally this requires a finite-state-machine, and what makes FSMs interesting and complex is always the way you diagnose malformed data in addition to recognizing correct data. Now, a finite-state-machine is, to my mind, the ultimate procedural abstraction, the quintessential opposite of "declarative" expression. To be declarative about a FSM you end up saying "recognize this regular language", and providing a description of the regular language, which is of course, just begging the question of how it actually works. (And for us, we're not really talking about a regular language of character text, but a pattern of usage in the binary data layout that obeys the pattern of a regular language. So it's not like having a little regular expression thing for validating text strings helps with this problem.) I guess I'm arguing that a black box approach to this is not only acceptable, but is highly likely to be the only "good" way to do it. In light of this I've suggested a rep property called "streamFormat" (perhaps should be renamed "recordFormat"), which gets values from the set VS, V, VBS, FB, FBS, etc. etc. all these well-defined legacy data formats (there are 19 of them I think). In additon, one should be able to extend this by introduction of a blackbox transformation. And ... here's the rub...if that's true for this case, then other "hard" examples like run-length encoding seem also in this category. There's several "leaps of faith" just made in these arguments, so i'd still like people to take this "XML challenge" and see if there's some magic I'm overlooking. ...mikeb ________________________________ From: Myers, James D [mailto:jim.myers@pnl.gov] Sent: Friday, November 19, 2004 9:52 AM To: dfdl-wg(a)gridforum.org Subject: RE: [dfdl-wg] simple way to study hard DFDL example problem - IBM Format VS rec ords as XML Without digging too much into the details, I'd say this is an example where multi-layer comes in. The DFDL would describe a hidden layer in which the first, middle, last data elements would be identified and put into a list, and then that hidden list would be used as the input to create items in the output layer. I think this is conceptually similar to one of our run-length encoding examples (more complex of course). If you read a sequence if ints and then a sequence of floats and need to output a sequence of floats with int[i] repeats of float[i], it would be easiest to create a hidden layer representing the int and float sequences and to then produce output from that. If you don't think about a layer, even this example gets painful - I need to read an int, skip forward somewhere to find a float, skip back to get the next int, etc. Mike's full example, not starting with the XML-ized version, might be something that requires more than one layer - read the original into something with with XML schema Mike defines, then a layer making a sequence of data elements, and then something that has the desired logical output. I guess I would claim that this would not be too bad a way to describe a fairly complex format in terms of a fairly different logical structure. Whether one *should* do this in DFDL, or whether it would make more sense to a) write a black box parser to get to items, or b) use DFDL to get to the initial schema Mike wrote and use XSLT afterwards to convert to the desired logical structure. I think there are enough cases where we need the multilayer functionality in DFDL that are relatively simple that we have to have it, which means it will then be possible to deal with complex transformations in DFDL even if not simple/practical. Jim -----Original Message----- From: owner-dfdl-wg(a)ggf.org [mailto:owner-dfdl-wg@ggf.org] On Behalf Of mike.beckerle(a)ascentialsoftware.com Sent: Thursday, November 18, 2004 9:53 PM To: dfdl-wg(a)gridforum.org Subject: [dfdl-wg] simple way to study hard DFDL example problem - IBM Format VS rec ords as XML I've come up with a way to articulate the difficulties I'm having with DFDL for complex file formats. This problem may not be that hard for someone with more XML, XPath or XQuery experience, so I'd apprecate it if you could look it over and if necessary even run it by your resident XML experts. In case the emailer mangles all the line lengths, I've also attached the below as a file.    <ITEM>The first item</ITEM> <ITEM>This is the second item</ITEM> <ITEM>The third</ITEM>  <sequence> <element name="ITEM" type="string" minOccurs="0" maxOccurs="unbounded"/> </sequence>  <BLOCK> <SEGMENT> <WHOLE/>  <DATA>The first item</DATA> </SEGMENT> </BLOCK> <BLOCK> <SEGMENT> <FIRST/>  <DATA>Thi</DATA> </SEGMENT> </BLOCK> <BLOCK> <SEGMENT> <MIDDLE/>  <DATA>s is t</DATA> </SEGMENT> </BLOCK> <BLOCK> <SEGMENT> <MIDDLE/> <DATA>he sec</DATA> </SEGMENT> </BLOCK> <BLOCK> <SEGMENT> <LAST/>  <DATA>ond item</DATA> </SEGMENT> <SEGMENT> <WHOLE/> <DATA>Third item</DATA> </SEGMENT> </BLOCK>          <complexType name="Format_VS_t"> <sequence> <element name="BLOCK" type="Block_t" minOccurs="0" maxOccurs="unbounded"/> </sequence> </complexType> <complexType name="Block_t"> <sequence> <element name="SEGMENT" type="Segment_t" minOccurs="1" maxOccurs="2"/> </sequence> </complexType> <complexType name="Segment_t"> <sequence> <choice> <element name="WHOLE"> </element> <element name="FIRST"> </element> <element name="LAST"> </element> <element name="MIDDLE"> </element> </choice> <element name="DATA" type="string"/> </sequence> </complexType>

1 0

RE: [dfdl-wg] simple way to study hard DFDL example problem - IBMFormat VS rec ords as XML
by Myers, James D 19 Nov '04

19 Nov '04

I was thinking that step 1 involved recognizing the <first/> and <data> elements and creating a sequence of <myfirst>here's the data</myfirst>, <mymiddle>more data</mymiddle> and <mylast>... elements and then assembling the new layer by some sort of choice to concatenate the relevant myfirst, optional mymiddle, and myend elements for each item. I think that requires a way to make a choice based on the <first/>, <middle/>, <last/> elements and populate either a <myfirst>, <mymiddle>, or <mylast> elements (all subtypes of string?) with the contents of the following data element, which I think we can do in DFDL. This is just our standard choice flag that decides which of several options exist. Then, I think you'd need logic to decide how many elements represent one item, which I think we have, followed by a way to concatenate these elements to produce a string source, which again I think we have (same as saying a complex can be built from two floats referenced from another layer instead of from a float stream). This part is the same problem as having a text file where one <CR> separates lines and <CR><CR> separates paragraphs and you want to create single strings (from a variable number of lines) for each paragraph. Again, I won't argue that this is simple and fun, but I think the machinery exists and is the same as that from our simple examples. Jim -----Original Message----- From: owner-dfdl-wg(a)ggf.org [mailto:owner-dfdl-wg@ggf.org] On Behalf Of mike.beckerle(a)ascentialsoftware.com Sent: Friday, November 19, 2004 10:44 AM To: Myers, James D; dfdl-wg(a)gridforum.org Subject: RE: [dfdl-wg] simple way to study hard DFDL example problem - IBMFormat VS rec ords as XML You are thinking along the lines I was; however, the challenge is that I cannot find a way to do this using multilayer so I'm uncomfortable suggesting that it's possible at all anymore. Here's some reasoning why. In particular, it's the intersection of the induction across the items with the first, middle*, last thing, and the spanning that seems to defy my efforts to cut it up into progressive transformation layer by layer. In some conversations I've referred to this problem as the "non-conforming trees" problem. The fundamental shapes of the trees are not compatible, and expressing the transformation between them isn't easily done via induction of any kind on one or the other of the trees. To me the First, Middle*, Last thing is very problematic. It's effectively a little regular language (in the formal sense) that has to be recognized. Generally this requires a finite-state-machine, and what makes FSMs interesting and complex is always the way you diagnose malformed data in addition to recognizing correct data. Now, a finite-state-machine is, to my mind, the ultimate procedural abstraction, the quintessential opposite of "declarative" expression. To be declarative about a FSM you end up saying "recognize this regular language", and providing a description of the regular language, which is of course, just begging the question of how it actually works. (And for us, we're not really talking about a regular language of character text, but a pattern of usage in the binary data layout that obeys the pattern of a regular language. So it's not like having a little regular expression thing for validating text strings helps with this problem.) I guess I'm arguing that a black box approach to this is not only acceptable, but is highly likely to be the only "good" way to do it. In light of this I've suggested a rep property called "streamFormat" (perhaps should be renamed "recordFormat"), which gets values from the set VS, V, VBS, FB, FBS, etc. etc. all these well-defined legacy data formats (there are 19 of them I think). In additon, one should be able to extend this by introduction of a blackbox transformation. And ... here's the rub...if that's true for this case, then other "hard" examples like run-length encoding seem also in this category. There's several "leaps of faith" just made in these arguments, so i'd still like people to take this "XML challenge" and see if there's some magic I'm overlooking. ...mikeb ________________________________ From: Myers, James D [mailto:jim.myers@pnl.gov] Sent: Friday, November 19, 2004 9:52 AM To: dfdl-wg(a)gridforum.org Subject: RE: [dfdl-wg] simple way to study hard DFDL example problem - IBM Format VS rec ords as XML Without digging too much into the details, I'd say this is an example where multi-layer comes in. The DFDL would describe a hidden layer in which the first, middle, last data elements would be identified and put into a list, and then that hidden list would be used as the input to create items in the output layer. I think this is conceptually similar to one of our run-length encoding examples (more complex of course). If you read a sequence if ints and then a sequence of floats and need to output a sequence of floats with int[i] repeats of float[i], it would be easiest to create a hidden layer representing the int and float sequences and to then produce output from that. If you don't think about a layer, even this example gets painful - I need to read an int, skip forward somewhere to find a float, skip back to get the next int, etc. Mike's full example, not starting with the XML-ized version, might be something that requires more than one layer - read the original into something with with XML schema Mike defines, then a layer making a sequence of data elements, and then something that has the desired logical output. I guess I would claim that this would not be too bad a way to describe a fairly complex format in terms of a fairly different logical structure. Whether one *should* do this in DFDL, or whether it would make more sense to a) write a black box parser to get to items, or b) use DFDL to get to the initial schema Mike wrote and use XSLT afterwards to convert to the desired logical structure. I think there are enough cases where we need the multilayer functionality in DFDL that are relatively simple that we have to have it, which means it will then be possible to deal with complex transformations in DFDL even if not simple/practical. Jim -----Original Message----- From: owner-dfdl-wg(a)ggf.org [mailto:owner-dfdl-wg@ggf.org] On Behalf Of mike.beckerle(a)ascentialsoftware.com Sent: Thursday, November 18, 2004 9:53 PM To: dfdl-wg(a)gridforum.org Subject: [dfdl-wg] simple way to study hard DFDL example problem - IBM Format VS rec ords as XML I've come up with a way to articulate the difficulties I'm having with DFDL for complex file formats. This problem may not be that hard for someone with more XML, XPath or XQuery experience, so I'd apprecate it if you could look it over and if necessary even run it by your resident XML experts. In case the emailer mangles all the line lengths, I've also attached the below as a file.    <ITEM>The first item</ITEM> <ITEM>This is the second item</ITEM> <ITEM>The third</ITEM>  <sequence> <element name="ITEM" type="string" minOccurs="0" maxOccurs="unbounded"/> </sequence>  <BLOCK> <SEGMENT> <WHOLE/>  <DATA>The first item</DATA> </SEGMENT> </BLOCK> <BLOCK> <SEGMENT> <FIRST/>  <DATA>Thi</DATA> </SEGMENT> </BLOCK> <BLOCK> <SEGMENT> <MIDDLE/>  <DATA>s is t</DATA> </SEGMENT> </BLOCK> <BLOCK> <SEGMENT> <MIDDLE/> <DATA>he sec</DATA> </SEGMENT> </BLOCK> <BLOCK> <SEGMENT> <LAST/>  <DATA>ond item</DATA> </SEGMENT> <SEGMENT> <WHOLE/> <DATA>Third item</DATA> </SEGMENT> </BLOCK>          <complexType name="Format_VS_t"> <sequence> <element name="BLOCK" type="Block_t" minOccurs="0" maxOccurs="unbounded"/> </sequence> </complexType> <complexType name="Block_t"> <sequence> <element name="SEGMENT" type="Segment_t" minOccurs="1" maxOccurs="2"/> </sequence> </complexType> <complexType name="Segment_t"> <sequence> <choice> <element name="WHOLE"> </element> <element name="FIRST"> </element> <element name="LAST"> </element> <element name="MIDDLE"> </element> </choice> <element name="DATA" type="string"/> </sequence> </complexType>

2 1