- dfdl-wg - lists.ogf.org

RE: [dfdl-wg] Transform examples
by Myers, James D 22 Nov '04

22 Nov '04

Just a high level comment (not sure I'd split things up the same way you have but I won't comment on that now ...): I'm not averse to putting something in DFDL about the expectations about length of fields, but is this useful in practice versus simply coding this in the parser you build? The parser is really the thing that will use it to calculate offsets, so some methods related to getting offsets on the transform classes are really what's needed - is that made simpler if there's info in the DFDL transform description? (And can I safely ignore this info in a dumb reference implementation that can only calculate offsets by actually parsing and then counting?) Jim -----Original Message----- From: owner-dfdl-wg(a)ggf.org [mailto:owner-dfdl-wg@ggf.org] On Behalf Of mike.beckerle(a)ascentialsoftware.com Sent: Monday, November 22, 2004 11:01 AM To: Chappell, Alan R; dfdl-wg(a)gridforum.org Subject: RE: [dfdl-wg] Transform examples Alan, I looked at these examples. There's one thing I think you've overlooked in the way transforms are specified here. This is the fact that intFromBinary knows that it will pick exactly 4 bytes off the input stream, and could advertise that property to the DFDL "system" in some way, whereas intFromAscii might take anything from 1 to however many characters. E.g., it might be able to tolerate whitespace of any size, leading zeros, etc. So as a transform it needs to advertise that the length of data being consumed requires that you run the transform. Where I'm coming from is this. It is very important that a DFDL description of data enable processing the data efficiently. To me that means that if data is all fixed width, then one should be able to randomly access fields in the data in constant time. Even if the data is variable width, one should be able to efficiently skip through it to find the boundaries without necessarily having to process all the data, convert to common format, etc. To achieve this, transformations must support determining length and determining value separately when possible. There are these things I call "length protocols" 1) FIXED_LENGTH: the length is static in the meaning of the type. E.g., 4 byte length is implicit in the type "int" 2) STATIC_LENGTH: the length is static as part of the element definition. E.g., 12 digit packed decimal known from the Cobol FD. Or a string with exactly 12 characters. (note that we ignore implications of variable-width chracter encodings like UTF-8 here on purpose more on that below). 3) OUTSIDE_LENGTH: the length is dynamic, and comes from elsewhere. I.e., consider a stored length prefix field. We probably don't have to touch the data to skip past it, for example, though we did have to read the length field someplace to know how far to skip. 4) PARSE_LENGTH: the length is dynamic, and computing the length of the element is as hard as computing the value, so you might as well do them both simultaneously (e.g., delimited text situation) Now the character set issue. If the character set is fixed width, like ascii, ebcdic, or UTF-16, then the above apply as defined. If the data format is text and the character set is variable width, like UTF-8, or Shift-jis, then 1, 2, and 3 all collapse into 4. I.e., all lengths require you to parse the characters one by one. However, I'd like this detail to be pushed down into the DFDL implementation because there are different ways to do it. E.g., you could do like Java and convert everything to UTF-16 first and eliminate the whole issue, or you can try to be more clever. I think transforms must advertise the protocols they support. E.g., intFromBinary in your example supports only FIXED_LENGTH protocol, and it should say the length is exactly 4. intFromAscii should support protocols 2, 3, and 4. Only protocol 4 supports delimiters and their attendant complexities like how embedded delimiters might be quoted or escaped. This "transform" function must compute both an integer value, and also compute the length of consumed data in the underlying stream, or by-side-effect advance the stream to the new position. The point is not to take a position on whether we manage lengths, or have a stateful cursor on the stream, the point is that there are 3 functions to provide. One is parameterized by a static length, One is parameterized by a dynamic length, and the third is parameterized by delimiters, escape sequence specifications, etc. All share the numbase parameter. This all adds baggage, but I think it is necessary or things just can't be efficient. ...mikeb ________________________________ From: Chappell, Alan R [mailto:chappella@BATTELLE.ORG] Sent: Friday, November 19, 2004 4:44 PM To: dfdl-wg(a)gridforum.org Subject: [dfdl-wg] Transform examples Third try... No zip, just the 3 files important to the simple transform example.... ________________________________ From: Chappell, Alan R Sent: Friday, November 19, 2004 1:39 PM To: dfdl-wg(a)gridforum.org Subject: *MJ-REJECTED* Transform examples Second try on sending these examples. I've cut the set down to the 3 important files so hopefully it will get through this time. ________________________________ From: Chappell, Alan R Sent: Thursday, November 18, 2004 8:47 AM To: dfdl-wg(a)gridforum.org Subject: *MJ-REJECTED* Transform examples Here is the example I mentioned yesterday. Look particularly at dfdltransforms.xsd, BasicAsciiIntExp.xsd, and BasicBinIntExp.xsd. Note the "Exp" on those last two files indicate that they are expansions of the information in the original versions of those files. These make a first stab at giving a fully verbose description of the structure and the transforms, i.e., it's working towards the canonical representation Martin talked about yesterday. The "dfdltransforms" gives the definitions of transforms and their components. There are lots of things that can be improved here. <<dfdl-examples.zip>> Alan R. Chappell chappella(a)battelle.org Pacific Northwest National Laboratory Battelle Seattle Research Center (206) 528-3228

1 0

RE: [dfdl-wg] Transform examples
by mike.beckerle＠ascentialsoftware.com 22 Nov '04

22 Nov '04

Alan, I looked at these examples. There's one thing I think you've overlooked in the way transforms are specified here. This is the fact that intFromBinary knows that it will pick exactly 4 bytes off the input stream, and could advertise that property to the DFDL "system" in some way, whereas intFromAscii might take anything from 1 to however many characters. E.g., it might be able to tolerate whitespace of any size, leading zeros, etc. So as a transform it needs to advertise that the length of data being consumed requires that you run the transform. Where I'm coming from is this. It is very important that a DFDL description of data enable processing the data efficiently. To me that means that if data is all fixed width, then one should be able to randomly access fields in the data in constant time. Even if the data is variable width, one should be able to efficiently skip through it to find the boundaries without necessarily having to process all the data, convert to common format, etc. To achieve this, transformations must support determining length and determining value separately when possible. There are these things I call "length protocols" 1) FIXED_LENGTH: the length is static in the meaning of the type. E.g., 4 byte length is implicit in the type "int" 2) STATIC_LENGTH: the length is static as part of the element definition. E.g., 12 digit packed decimal known from the Cobol FD. Or a string with exactly 12 characters. (note that we ignore implications of variable-width chracter encodings like UTF-8 here on purpose more on that below). 3) OUTSIDE_LENGTH: the length is dynamic, and comes from elsewhere. I.e., consider a stored length prefix field. We probably don't have to touch the data to skip past it, for example, though we did have to read the length field someplace to know how far to skip. 4) PARSE_LENGTH: the length is dynamic, and computing the length of the element is as hard as computing the value, so you might as well do them both simultaneously (e.g., delimited text situation) Now the character set issue. If the character set is fixed width, like ascii, ebcdic, or UTF-16, then the above apply as defined. If the data format is text and the character set is variable width, like UTF-8, or Shift-jis, then 1, 2, and 3 all collapse into 4. I.e., all lengths require you to parse the characters one by one. However, I'd like this detail to be pushed down into the DFDL implementation because there are different ways to do it. E.g., you could do like Java and convert everything to UTF-16 first and eliminate the whole issue, or you can try to be more clever. I think transforms must advertise the protocols they support. E.g., intFromBinary in your example supports only FIXED_LENGTH protocol, and it should say the length is exactly 4. intFromAscii should support protocols 2, 3, and 4. Only protocol 4 supports delimiters and their attendant complexities like how embedded delimiters might be quoted or escaped. This "transform" function must compute both an integer value, and also compute the length of consumed data in the underlying stream, or by-side-effect advance the stream to the new position. The point is not to take a position on whether we manage lengths, or have a stateful cursor on the stream, the point is that there are 3 functions to provide. One is parameterized by a static length, One is parameterized by a dynamic length, and the third is parameterized by delimiters, escape sequence specifications, etc. All share the numbase parameter. This all adds baggage, but I think it is necessary or things just can't be efficient. ...mikeb _____ From: Chappell, Alan R [mailto:chappella@BATTELLE.ORG] Sent: Friday, November 19, 2004 4:44 PM To: dfdl-wg(a)gridforum.org Subject: [dfdl-wg] Transform examples Third try... No zip, just the 3 files important to the simple transform example.... _____ From: Chappell, Alan R Sent: Friday, November 19, 2004 1:39 PM To: dfdl-wg(a)gridforum.org Subject: *MJ-REJECTED* Transform examples Second try on sending these examples. I've cut the set down to the 3 important files so hopefully it will get through this time. _____ From: Chappell, Alan R Sent: Thursday, November 18, 2004 8:47 AM To: dfdl-wg(a)gridforum.org Subject: *MJ-REJECTED* Transform examples Here is the example I mentioned yesterday. Look particularly at dfdltransforms.xsd, BasicAsciiIntExp.xsd, and BasicBinIntExp.xsd. Note the "Exp" on those last two files indicate that they are expansions of the information in the original versions of those files. These make a first stab at giving a fully verbose description of the structure and the transforms, i.e., it's working towards the canonical representation Martin talked about yesterday. The "dfdltransforms" gives the definitions of transforms and their components. There are lots of things that can be improved here. <<dfdl-examples.zip>> Alan R. Chappell chappella(a)battelle.org Pacific Northwest National Laboratory Battelle Seattle Research Center (206) 528-3228

1 0

RE: [dfdl-wg] simple way to study hard DFDL example problem - IBM Format VS rec ords as XML
by mike.beckerle＠ascentialsoftware.com 22 Nov '04

22 Nov '04

I agree that an extensible set of black-box stream-decoders (and complementary stream encoders) that handle: zip, encryption, VB, VBS, VS, and the other 19 or so complicated legacy formats, and so forth, is a good and completely acceptable solution to this problem. I think DFDL should be helpful to the person who has to write such a stream-decoder/encoder for describing the physical stream format. This is a pragmatic decision, and for something like zip/unzip I think there is little one could do which would be more efficient than this. The rub with VS format is that most of the data is sitting there in what is very very close to the correct logical data layout. This makes it feel like copying it all to remove that little bit of excess physical structure feels unnecessary, but I agree is probably the right thing to do given the complexity involved in trying to avoid copying it. ...mikeb > -----Original Message----- > From: Steve Hanson [mailto:smh@uk.ibm.com] > Sent: Monday, November 22, 2004 5:58 AM > To: dfdl-wg(a)gridforum.org > Subject: RE: [dfdl-wg] simple way to study hard DFDL example > problem - IBMFormat VS rec ords as XML > > > > > > I wrote my previous mail fairly quickly just before I left on > Friday to get something on the table. I've been thinking > about this problem over the weekend and have some more > thoughts which might help me get across where I am coming from. > > The way I view physical rep information is as functions that > can be applied to types and fields. Writing the data out to a > blocked/segmented format does not fall into this category. It > is an orthogonal operation that applies to the whole data and > as such is much more akin to encryption and compression. For > example, I have a COBOL structure that ends up in an MQSeries > queue and in a QSAM file. It has a logical structure, it has > a physical representation. In the QSAM case a further > transform has taken place to block/segment the structure. I > would not expect to see the physical rep properties of the > types and elements change. > > Mike's idea of a schema level 'stream' rep property sounds ok > in principle for parsing, but what other metadata is needed > when serialising? How are we informed of the rules for VB > blocking or for IMS segmentation? Are they fixed or > user-defined? If these rules end up requiring extra metadata > at the type/element level then I am not comfortable with > this, because we are mixing two sets of physical information. > > I think that whatever principles we apply to DFDL > including/excluding encryption and compression we should also > apply to these formats. What is the current proposal in this > area? The cheapest option would be to provide a flexible > user-defined transform capability. > > We can discuss more on this week's call, but it sounds like > this is another of the high-level design issues to be > included in the F2F agenda. > > Finally a correction. When I said that the broker does not > support these 19 or whatever formats, I should have been more > specific and said that the broker's message model does not > support these. That is, we do not provide physical rep > annotation support for such formats, for the reason stated > above. The expectation is that is that the > decryption/decompression/deblocking has all taken place as a > separate transformation elsewhere in the broker. > > Regards, Steve > > Steve Hanson > WebSphere Business Integration Brokers, > IBM Hursley, England > Internet: smh(a)uk.ibm.com > Phone (+44)/(0) 1962-815848 > > > > > "Myers, James D" > > <jim.myers(a)pnl.go > > v> > To > Sent by: dfdl-wg(a)gridforum.org > > owner-dfdl-wg@ggf > cc > .org > > > Subject > RE: [dfdl-wg] simple > way to study > 19/11/2004 17:04 hard DFDL example > problem - > IBMFormat VS rec > ords as XML > > > > > > > > > > > > > > > > > I think we at least agree in practice that there's a limit on > how complex a transform you'd want to code in DFDL logic. Not > sure if we agree on whether it is possible. > > As for LR parsers - I'm not a parser guy, but I just looked > at the wikipedia entry :-) : > > Seems like a simple enough concept - if you let me have > layers, and I can use information in those layers to select > choices for further processing, can you stop me from making > an LR parser (or doing what an LR parser does)? I've got a > stack, and choices let me specify an action table... In the > same way that if you give me layers (or variables), addition, > and for loops, you can't stop me from doing multiplication. > And if you require those things for other reasons but don't > need multiplication, you can't really talk about excluding > multiplication from the language design. You can say that we > won't worry about multiplcation examples or how easy it is to > write them down or what performance you'll get trying to run > them and suggest that you plug something in to handle them > directly though, and this is probably what we need to do in DFDL. > > I may still be missing something and there is a piece of > functionality that we haven't identified a need for that > would be needed for an LR parser/our pathological examples, > but I guess I'm getting more convinced that our primitives > are sufficiently powerful that they can be used/abused to do > all of the complex things that have come up. I'm not sure how > we can close the issue - specify the map from DFDL primitives > to LR parser as I started to above, or find an example known > to require LR parsing and work it? Or? > > Jim > > > > > -----Original Message----- > > From: owner-dfdl-wg(a)ggf.org [mailto:owner-dfdl-wg@ggf.org] > On Behalf > > Of mike.beckerle(a)ascentialsoftware.com > > Sent: Friday, November 19, 2004 11:36 AM > > To: smh(a)uk.ibm.com; dfdl-wg(a)gridforum.org > > Subject: RE: [dfdl-wg] simple way to study hard DFDL > example problem - > > IBMFormat VS rec ords as XML > > > > > > I believe you and Jim are actually disagreeing. Jim is saying he's > > still optimistic that this transformation, even though > complex, can be > > expressed directly in DFDL. You are saying this would > require XSLT or > > a Java program or whatever to do it. > > > > > > > > Mike you say you are aware of 19 such legacy formats, and I bet > > > there are more. Well IBM's broker has no specific support > for any of > > > these, nor have we been asked to incorporate them into > our message > > > model. Maybe we should play the percentages game - if we > see enough > > > different subsystems that use the same cryptic format then it > > > becomes worth building the support into DFDL. > > > > > > > Ascential supports 6 or 7 of these formats today. Batch > systems will > > encounter this more than online. You get them when a mainframe job > > writes out a tape on a mainframe, and then you read that tape on a > > unix tape drive either directly or first into a file. > Alternatively, > > you pick up a mainframe file via FTP or some such and > directly operate > > on it on other systems. > > Mainframe software handles all the VS block and and such > stuff in the > > lower layers as you know (not to mention the tape label) > unix software > > does none of this, you just get the raw bytes. > > > > My point is not as much about these 19 or more particular > formats, but > > the issue of how much complexity we go after. > > > > In the past we've looked at things like logical arrays with > > run-length-encoded representations and the suggestion has > been there > > that DFDL might be able to directly express this transformation > > without need to go outside DFDL. > > > > I've come to believe there are certain limits to this > complexity and I > > think perhaps tree-shape compatibility is at the core of them. > > Building a DFDL > > description for data that ultimately requires an LR(k) > sophistication > > parser to correctly interpret the data is clearly a non-starter it > > seems. Where this line is drawn is important. > > > > ...mikeb > > > > > > > > > > ...mikeb > > > > > > >

1 0

RE: [dfdl-wg] simple way to study hard DFDL example problem - IBMFormat VS rec ords as XML
by Myers, James D 22 Nov '04

22 Nov '04

> The way I view physical rep information is as functions that > can be applied to types and fields. Writing the data out to a > blocked/segmented format does not fall into this category. It > is an orthogonal operation that applies to the whole data and > as such is much more akin to encryption and compression. For > example, I have a COBOL structure that ends up in an MQSeries > queue and in a QSAM file. It has a logical structure, it has > a physical representation. In the QSAM case a further > transform has taken place to block/segment the structure. I > would not expect to see the physical rep properties of the > types and elements change. I think we've been talking about DFDL as always going TO the XML schema and have considered the process of going FROM the XML to a new serialization as 'inverse DFDL'. Towards that end, we've discussed being able to mark transforms as invertible and/or allowing an inverse method to be registered as part of the transform definition. We also talked about the potential requirement of having multiple output streams: if I read x and y dimensions and then pixels, but my output XML model is just the pixel sequence, I will need to record x and y somewhere to allow inversion, so the user (or DFDL) might want to specify x and y in some separate 'provenance' file that could be used during inversion. I'm not sure that this is the best model, but I don't think we've come up with a good way to describe going from the XML model except as the inverse of the to process. > > Mike's idea of a schema level 'stream' rep property sounds ok > in principle for parsing, but what other metadata is needed > when serialising? How are we informed of the rules for VB > blocking or for IMS segmentation? Are they fixed or > user-defined? If these rules end up requiring extra metadata > at the type/element level then I am not comfortable with > this, because we are mixing two sets of physical information. > > I think that whatever principles we apply to DFDL > including/excluding encryption and compression we should also > apply to these formats. What is the current proposal in this > area? The cheapest option would be to provide a flexible > user-defined transform capability. We planned to have a user-defined transform capability that would appear in the same way as DFDL-standard transforms. I think one can easily put something like zip into the same format as Alan has done for the basic int from ascii, int from binary transforms, as a byte sequence to byte sequence transform. I think I'd vote for just including zip since it will be used in a number of formats, but one could imagine a user adding a de-pig-latinizer as needed. (Pig latin, and things like run-length encoding are examples we've used to point out that not all compression/encryption type algorithms will run on the raw input stream - both of these require some level of parsing before you can use them - to find words or to get the <value, # of repeats > pairs from the initial bytes.

1 0

RE: [dfdl-wg] Annotation complexity
by Myers, James D 22 Nov '04

22 Nov '04

> > Myers, James D wrote: > > I guess I'm not sure how restricting annotations to elements solves > > things. > > Sorry, I am finding it more difficult to be precise than I > thought with > this. What I mean here is _leaf_ elements/attributes - which > I think can > be defined as elements with simple type descriptions. I think if you > stick to these its unambiguous...no? I agree, I think - if anotations are only on simple types. That would eliminate the element hierarchy, though - does it eliminate the type hierarchy - simple element in base type has an annotation that gets inherited by the same simple element in a derived type. I think this would be unambiguous, but perhaps hard to find (surprise the global dfdl default for endianess doesn't apply because some subtype three levels back had an annotation) > > I think Mike has examples of annotations that he would like higher in > the tree but with this extreme position there would be no inheritance. Except through type derivation...? > > Two other thoughts occur: > - I think definitions applied to a data source are > orthogonal to this > scoping issue. We still need to understand precedence but that seems > relatively easy to resolve. Definitions on sources: I know we've talked about this (!!! - the whole outside-in or inside out representations...), but I think the only mechanism for this in the examples is to assign a source to an element and annotate that element with, for example, endianness. For some types of sources, e.g. other layers, that have an internal structure, any annotation of the source itself would essentially become an inheritable annotation on the elements within it. So - we could try to define such a mechanism to apply to sources, but I'm not sure it would be applicable to all without recreating some of the inheritance issues. > > - Another approach would be to allow annotation locations to be > specified using XPath. Since XPath can specify both specific > locations > and large groups of locations this could be quite powerful. Hmmm. Yes, as long as xpaths can only be defined from a single root, i.e. not from within type definitions relative to the type (not talking about layers here). Otherwise we don't get rid of inheritance/precedence issues. At a minimum, this approach might help collect all of the defaults together at the top level for readability, but it might separate defaults from reusable types. Jim

2 1

RE: [dfdl-wg] simple way to study hard DFDL example problem - IBMFormat VS rec ords as XML
by Myers, James D 22 Nov '04

22 Nov '04

I think we at least agree in practice that there's a limit on how complex a transform you'd want to code in DFDL logic. Not sure if we agree on whether it is possible. As for LR parsers - I'm not a parser guy, but I just looked at the wikipedia entry :-) : Seems like a simple enough concept - if you let me have layers, and I can use information in those layers to select choices for further processing, can you stop me from making an LR parser (or doing what an LR parser does)? I've got a stack, and choices let me specify an action table... In the same way that if you give me layers (or variables), addition, and for loops, you can't stop me from doing multiplication. And if you require those things for other reasons but don't need multiplication, you can't really talk about excluding multiplication from the language design. You can say that we won't worry about multiplcation examples or how easy it is to write them down or what performance you'll get trying to run them and suggest that you plug something in to handle them directly though, and this is probably what we need to do in DFDL. I may still be missing something and there is a piece of functionality that we haven't identified a need for that would be needed for an LR parser/our pathological examples, but I guess I'm getting more convinced that our primitives are sufficiently powerful that they can be used/abused to do all of the complex things that have come up. I'm not sure how we can close the issue - specify the map from DFDL primitives to LR parser as I started to above, or find an example known to require LR parsing and work it? Or? Jim > -----Original Message----- > From: owner-dfdl-wg(a)ggf.org [mailto:owner-dfdl-wg@ggf.org] On > Behalf Of mike.beckerle(a)ascentialsoftware.com > Sent: Friday, November 19, 2004 11:36 AM > To: smh(a)uk.ibm.com; dfdl-wg(a)gridforum.org > Subject: RE: [dfdl-wg] simple way to study hard DFDL example > problem - IBMFormat VS rec ords as XML > > > I believe you and Jim are actually disagreeing. Jim is saying > he's still optimistic that this transformation, even though > complex, can be expressed directly in DFDL. You are saying > this would require XSLT or a Java program or whatever to do it. > > > > > Mike you say you are aware of 19 such legacy formats, and I > > bet there are more. Well IBM's broker has no specific support > > for any of these, nor have we been asked to incorporate them > > into our message model. Maybe we should play the percentages > > game - if we see enough different subsystems that use the > > same cryptic format then it becomes worth building the > > support into DFDL. > > > > Ascential supports 6 or 7 of these formats today. Batch systems will > encounter this more than online. You get them when a > mainframe job writes > out a tape on a mainframe, and then you read that tape on a > unix tape drive > either directly or first into a file. Alternatively, you pick > up a mainframe > file via FTP or some such and directly operate on it on other systems. > Mainframe software handles all the VS block and and such > stuff in the lower > layers as you know (not to mention the tape label) unix > software does none > of this, you just get the raw bytes. > > My point is not as much about these 19 or more particular > formats, but the > issue of how much complexity we go after. > > In the past we've looked at things like logical arrays with > run-length-encoded representations and the suggestion has > been there that > DFDL might be able to directly express this transformation > without need to > go outside DFDL. > > I've come to believe there are certain limits to this > complexity and I think > perhaps tree-shape compatibility is at the core of them. > Building a DFDL > description for data that ultimately requires an LR(k) > sophistication parser > to correctly interpret the data is clearly a non-starter it > seems. Where > this line is drawn is important. > > ...mikeb > > > > > ...mikeb > >

2 1

RE: [dfdl-wg] Annotation complexity
by mike.beckerle＠ascentialsoftware.com 19 Nov '04

19 Nov '04

In my experience the rep properties fall in two classes with respect to being inherited by contained elements. Type 1: e.g., byteOrder - applies and inherited on contained scope. can be overridden by a more local definition. Type 2: e.g., delimiters (terminators and separators), or alignment - applies only to the element or grouping structure to which it is attached. Example: Consider a CSV file. Each row is terminated by a CRLF. A comma separates the fields of a row. These fields are "contained" within the row. But the fields are NOT terminated by a CRLF, so we don't want to inherit that terminator def down the containment hierarchy. At some point we'll have to categorize all the rep properties into these two classes. ...mikeb _____ From: Myers, James D [mailto:jim.myers@pnl.gov] Sent: Friday, November 19, 2004 4:13 PM To: dfdl-wg(a)gridforum.org Subject: RE: [dfdl-wg] Annotation complexity I think I get the issue, but not necessarily the proposed solution. I can have an element with a complex type that contains other elements with complex types and I might want params such as littleendian to follow that hierarchy, independent of the type derivation hierarchy. Are you saying that inheritance should never flow down the 'contains' hierarchy? Jim PS. After two days of this in December, I may need help in getting myself to the airport ... ;-) -----Original Message----- From: owner-dfdl-wg(a)ggf.org [mailto:owner-dfdl-wg@ggf.org] On Behalf Of Suman Kalia Sent: Friday, November 19, 2004 3:47 PM To: dfdl-wg(a)gridforum.org Subject: Fw: [dfdl-wg] Annotation complexity You are beginning to see the complexities of overrides in this very simple example. Consider complex type hierarchy which is four/five levels deep and then determine which override is applicable and where in the tree. I briefly mentioned in the call on Wednesday, that we should carefully determine which annotations are applicable for which constructs of XML schema and try to avoid the override mechanism. Annotations that exist on the element belong only to the element; there is no inheritance or override issue here. Annotations that exist on structural constructs (group/complex types etc) truly belong to the structure only ( such as data element separators, delimiter which cannot be associated with elements because elements could be reused via element Ref in other structures where they could have different delimiter in that structure etc). Once we have separated the types and elements; then annotations defined on derived types can follow the well established rules of inheritance ie inherit annotation from parent unless explicitly overridden etc.. Then comes the issue of defaults - where to locate and apply. Possible options are a) top level type ( which for example could be type corresponding to 01 level COBOL structure) b) A separate structure available at tooling and runtime which contains the defaults. We used the latter in our implementation. Suman Kalia IBM Toronto Lab WebSphere Business Integration Application Connectivity Tools Tel : 905-413-3923 T/L 969-3923 Fax : 905-413-4850 Internet ID : kalia(a)ca.ibm.com ----- Forwarded by Suman Kalia/Toronto/IBM on 11/19/2004 03:04 PM ----- "Myers, James D" <jim.myers(a)pnl.gov> Sent by: owner-dfdl-wg(a)ggf.org 11/19/2004 02:25 PM To dfdl-wg(a)gridforum.org cc Subject RE: [dfdl-wg] Annotation complexity I guess I'm not sure how restricting annotations to elements solves things. I think I can recreate the problems in Martin's examples without putting annotations on types: The issue of it being hard to understand that triple overrides the dfdlfromstrings param would seem to be the same whether the triple type has an annotation or if some subelements within it get annotations (either first second and third, or consider a triple type that specifies an annotated element containing those three). In all these cases, it is clear that you have to walk down the logical hierarchy which is broken into parts in the dfdl/xsd file and keep a stack of contexts if we allow any default/scoped annotations. If annotations are allowed on both types and elements, what I find even more difficult are situations where the triple type has one default, and the element in "data" with that type has an annotation specifying the opposite param value. Do we consider the element to be above the type in the scope hierarchy? For more fun, what if triple is derived some other type where the annotation is defined. Would an annotation on the "data" element be inherited by the sub-element of type triple, or would the inheritance from the triple base type win (i.e. neither the element of the type triple or the triple type itself are directly annotated). (Or consider an annotation on the "first" element defined in the base type for triple rather than on the base type directly - does the element annotation inherited from the type hierarchy trump the one from the element hierarchy?) An attempt at a picture where only elements have annotations: Element A : param=littleendian SubElement B: type ST Type T: SubElement C: param: bigendian Type ST: subtype of T What is the param value of element C at A/B/C? I guess I see a need to keep some hierarchically scoped defaults (a file that has some ascii info and then a base64 encoded section of littleendian stuff), but xsd makes it hard to define a single hierarchy. Perhaps some rule of precedence - resolve annotations from type to subtype first, then push those onto the stack of element scopes - would make things unambiguous, if not user friendly. Jim > -----Original Message----- > From: owner-dfdl-wg(a)ggf.org [mailto:owner-dfdl-wg@ggf.org] On > Behalf Of Martin Westhead > Sent: Friday, November 19, 2004 1:38 PM > To: Martin Westhead > Cc: dfdl-wg(a)gridforum.org > Subject: Re: [dfdl-wg] Annotation complexity > > > Sorry the elements in the triple were all supposed to be of a simple > type e.g.: > > <xs:complexType name="triple"> > <xs:annotation> > <xs:appinfo> > <dfdlFromBinary/> > </xs:appinfo> > </xs:annotation> > <xs:sequence> > <xs:element name="first" type="xs:int"/> > <xs:element name="second" type="xs:int"/> > <xs:element name="third" type="xs:int"/> > </xs:sequence> > </xs:complexType> > > > <xs:complexType name="data"> > <xs:annotation> > <xs:appinfo> > <dfdlFromStrings/> > </xs:appinfo> > </xs:annotation> > <xs:sequence> > <xs:element name="triple"/> > </xs:sequence> > </xs:complexType> > > Martin Westhead wrote: > > Hi, > > > > I think I understand Suman's issue with annotations on the Schema > > tree. > > (Please Suman tell me if I am right here). The problem is, that > > lexically there are many trees in an XSD. Whilst in > practice these can > > clearly be considered as a single tree (including, I think, > even the > > simple type hierarchies) by placing all the type > definitions inline, > > this is not the way they appear to the user. So for example > if I have a > > file with conflicting annotations looking like: > > > > <xs:complexType name="triple"> > > <xs:annotation> > > <xs:appinfo> > > <dfdlFromBinary/> > > </xs:appinfo> > > </xs:annotation> > > <xs:sequence> > > <xs:element name="first" type="xs:int"/> > > <xs:element name="second"/> > > <xs:element name="third"/> > > </xs:sequence> > > </xs:complexType> > > > > > > <xs:complexType name="data"> > > <xs:annotation> > > <xs:appinfo> > > <dfdlFromStrings/> > > </xs:appinfo> > > </xs:annotation> > > <xs:sequence> > > <xs:element name="triple"/> > > </xs:sequence> > > </xs:complexType> > > > > So what I imagined is that we would assume that the "triple" type is > > considered _inside_ the scope of the "data" type and so the > > "dfdlFromBinary" tag wins. > > > > On the other hand the user sees two trees of equal depth with > > conflicting annotations. The examples can obviously get much more > > intricate. > > > > The issue is really that the scope of the annotations is > not lexically > > defined. At some level this is just like having globally included > > variables in a programming language. On the other hand we > have arbitrary > > levels of these. > > > > Suman is this the problem? > > > > If this is the problem, and we agree that it is too confusing to the > > user (my opinion is still out on this). Then I see that the > conclusion > > is to adopt an approach similar to IBM's that annotations > can appear > > only on <element> and <attribute> tags. Even the top level > of the file > > is confusing since there may be many files involved. I > guess we can also > > have runtime defaults and default settings set in the > standard. I don't > > like this conclusion incidentally, can someone convince me > it is the > > wrong one? > > > > Martin > > > > > > > > > > > > > > > > > >

1 0

Fw: [dfdl-wg] Annotation complexity
by Suman Kalia 19 Nov '04

19 Nov '04

You are beginning to see the complexities of overrides in this very simple example. Consider complex type hierarchy which is four/five levels deep and then determine which override is applicable and where in the tree. I briefly mentioned in the call on Wednesday, that we should carefully determine which annotations are applicable for which constructs of XML schema and try to avoid the override mechanism. Annotations that exist on the element belong only to the element; there is no inheritance or override issue here. Annotations that exist on structural constructs (group/complex types etc) truly belong to the structure only ( such as data element separators, delimiter which cannot be associated with elements because elements could be reused via element Ref in other structures where they could have different delimiter in that structure etc). Once we have separated the types and elements; then annotations defined on derived types can follow the well established rules of inheritance ie inherit annotation from parent unless explicitly overridden etc.. Then comes the issue of defaults - where to locate and apply. Possible options are a) top level type ( which for example could be type corresponding to 01 level COBOL structure) b) A separate structure available at tooling and runtime which contains the defaults. We used the latter in our implementation. Suman Kalia IBM Toronto Lab WebSphere Business Integration Application Connectivity Tools Tel : 905-413-3923 T/L 969-3923 Fax : 905-413-4850 Internet ID : kalia(a)ca.ibm.com ----- Forwarded by Suman Kalia/Toronto/IBM on 11/19/2004 03:04 PM ----- "Myers, James D" <jim.myers(a)pnl.gov> Sent by: owner-dfdl-wg(a)ggf.org 11/19/2004 02:25 PM To dfdl-wg(a)gridforum.org cc Subject RE: [dfdl-wg] Annotation complexity I guess I'm not sure how restricting annotations to elements solves things. I think I can recreate the problems in Martin's examples without putting annotations on types: The issue of it being hard to understand that triple overrides the dfdlfromstrings param would seem to be the same whether the triple type has an annotation or if some subelements within it get annotations (either first second and third, or consider a triple type that specifies an annotated element containing those three). In all these cases, it is clear that you have to walk down the logical hierarchy which is broken into parts in the dfdl/xsd file and keep a stack of contexts if we allow any default/scoped annotations. If annotations are allowed on both types and elements, what I find even more difficult are situations where the triple type has one default, and the element in "data" with that type has an annotation specifying the opposite param value. Do we consider the element to be above the type in the scope hierarchy? For more fun, what if triple is derived some other type where the annotation is defined. Would an annotation on the "data" element be inherited by the sub-element of type triple, or would the inheritance from the triple base type win (i.e. neither the element of the type triple or the triple type itself are directly annotated). (Or consider an annotation on the "first" element defined in the base type for triple rather than on the base type directly - does the element annotation inherited from the type hierarchy trump the one from the element hierarchy?) An attempt at a picture where only elements have annotations: Element A : param=littleendian SubElement B: type ST Type T: SubElement C: param: bigendian Type ST: subtype of T What is the param value of element C at A/B/C? I guess I see a need to keep some hierarchically scoped defaults (a file that has some ascii info and then a base64 encoded section of littleendian stuff), but xsd makes it hard to define a single hierarchy. Perhaps some rule of precedence - resolve annotations from type to subtype first, then push those onto the stack of element scopes - would make things unambiguous, if not user friendly. Jim > -----Original Message----- > From: owner-dfdl-wg(a)ggf.org [mailto:owner-dfdl-wg@ggf.org] On > Behalf Of Martin Westhead > Sent: Friday, November 19, 2004 1:38 PM > To: Martin Westhead > Cc: dfdl-wg(a)gridforum.org > Subject: Re: [dfdl-wg] Annotation complexity > > > Sorry the elements in the triple were all supposed to be of a simple > type e.g.: > > <xs:complexType name="triple"> > <xs:annotation> > <xs:appinfo> > <dfdlFromBinary/> > </xs:appinfo> > </xs:annotation> > <xs:sequence> > <xs:element name="first" type="xs:int"/> > <xs:element name="second" type="xs:int"/> > <xs:element name="third" type="xs:int"/> > </xs:sequence> > </xs:complexType> > > > <xs:complexType name="data"> > <xs:annotation> > <xs:appinfo> > <dfdlFromStrings/> > </xs:appinfo> > </xs:annotation> > <xs:sequence> > <xs:element name="triple"/> > </xs:sequence> > </xs:complexType> > > Martin Westhead wrote: > > Hi, > > > > I think I understand Suman's issue with annotations on the Schema > > tree. > > (Please Suman tell me if I am right here). The problem is, that > > lexically there are many trees in an XSD. Whilst in > practice these can > > clearly be considered as a single tree (including, I think, > even the > > simple type hierarchies) by placing all the type > definitions inline, > > this is not the way they appear to the user. So for example > if I have a > > file with conflicting annotations looking like: > > > > <xs:complexType name="triple"> > > <xs:annotation> > > <xs:appinfo> > > <dfdlFromBinary/> > > </xs:appinfo> > > </xs:annotation> > > <xs:sequence> > > <xs:element name="first" type="xs:int"/> > > <xs:element name="second"/> > > <xs:element name="third"/> > > </xs:sequence> > > </xs:complexType> > > > > > > <xs:complexType name="data"> > > <xs:annotation> > > <xs:appinfo> > > <dfdlFromStrings/> > > </xs:appinfo> > > </xs:annotation> > > <xs:sequence> > > <xs:element name="triple"/> > > </xs:sequence> > > </xs:complexType> > > > > So what I imagined is that we would assume that the "triple" type is > > considered _inside_ the scope of the "data" type and so the > > "dfdlFromBinary" tag wins. > > > > On the other hand the user sees two trees of equal depth with > > conflicting annotations. The examples can obviously get much more > > intricate. > > > > The issue is really that the scope of the annotations is > not lexically > > defined. At some level this is just like having globally included > > variables in a programming language. On the other hand we > have arbitrary > > levels of these. > > > > Suman is this the problem? > > > > If this is the problem, and we agree that it is too confusing to the > > user (my opinion is still out on this). Then I see that the > conclusion > > is to adopt an approach similar to IBM's that annotations > can appear > > only on <element> and <attribute> tags. Even the top level > of the file > > is confusing since there may be many files involved. I > guess we can also > > have runtime defaults and default settings set in the > standard. I don't > > like this conclusion incidentally, can someone convince me > it is the > > wrong one? > > > > Martin > > > > > > > > > > > > > > > > > >

2 1

RE: [dfdl-wg] Annotation complexity
by mike.beckerle＠ascentialsoftware.com 19 Nov '04

19 Nov '04

I don't see the problem here. > > An attempt at a picture where only elements have annotations: > > Element A : param=littleendian > SubElement B: type ST > Type T: > SubElement C: param:bigendian > > Type ST: subtype of T > > What is the param value of element C at A/B/C? > I propose this reasoning. Local element definition has highest precedence. Then type of element including subtype-based inheritance. Then lexical scope. So the param value of element C is bigEndian. To me this is clearly correct. However, I think the type names T and ST better reflect that these include representation information otherwise the user writing the XSD for A with subelement B will be surprised. E.g., Type T could be MainframeCobolComp3 and type ST could be MainframeCobolComp3_8_2 meaning with restrictions to 8 digits and 2 fractional digits. The point is that these type names should reflect that these types include representation matters, otherwise why wouldn't someone use them in a context where they think they still have control of the byteOrder. ...mikeb

1 0

Fw: [dfdl-wg] Annotation complexity
by Suman Kalia 19 Nov '04

19 Nov '04

Yes -- the annotations defined on the element stays on the element and do not follow the contain hierarchy. This is not ideal but reduces the complexity of overrides between element and types.. Suman Kalia IBM Toronto Lab WebSphere Business Integration Application Connectivity Tools Tel : 905-413-3923 T/L 969-3923 Fax : 905-413-4850 Internet ID : kalia(a)ca.ibm.com ----- Forwarded by Suman Kalia/Toronto/IBM on 11/19/2004 04:22 PM ----- "Myers, James D" <jim.myers(a)pnl.gov> Sent by: owner-dfdl-wg(a)ggf.org 11/19/2004 04:12 PM To dfdl-wg(a)gridforum.org cc Subject RE: [dfdl-wg] Annotation complexity I think I get the issue, but not necessarily the proposed solution. I can have an element with a complex type that contains other elements with complex types and I might want params such as littleendian to follow that hierarchy, independent of the type derivation hierarchy. Are you saying that inheritance should never flow down the 'contains' hierarchy? Jim PS. After two days of this in December, I may need help in getting myself to the airport ... ;-) -----Original Message----- From: owner-dfdl-wg(a)ggf.org [mailto:owner-dfdl-wg@ggf.org] On Behalf Of Suman Kalia Sent: Friday, November 19, 2004 3:47 PM To: dfdl-wg(a)gridforum.org Subject: Fw: [dfdl-wg] Annotation complexity You are beginning to see the complexities of overrides in this very simple example. Consider complex type hierarchy which is four/five levels deep and then determine which override is applicable and where in the tree. I briefly mentioned in the call on Wednesday, that we should carefully determine which annotations are applicable for which constructs of XML schema and try to avoid the override mechanism. Annotations that exist on the element belong only to the element; there is no inheritance or override issue here. Annotations that exist on structural constructs (group/complex types etc) truly belong to the structure only ( such as data element separators, delimiter which cannot be associated with elements because elements could be reused via element Ref in other structures where they could have different delimiter in that structure etc). Once we have separated the types and elements; then annotations defined on derived types can follow the well established rules of inheritance ie inherit annotation from parent unless explicitly overridden etc.. Then comes the issue of defaults - where to locate and apply. Possible options are a) top level type ( which for example could be type corresponding to 01 level COBOL structure) b) A separate structure available at tooling and runtime which contains the defaults. We used the latter in our implementation. Suman Kalia IBM Toronto Lab WebSphere Business Integration Application Connectivity Tools Tel : 905-413-3923 T/L 969-3923 Fax : 905-413-4850 Internet ID : kalia(a)ca.ibm.com ----- Forwarded by Suman Kalia/Toronto/IBM on 11/19/2004 03:04 PM ----- "Myers, James D" <jim.myers(a)pnl.gov> Sent by: owner-dfdl-wg(a)ggf.org 11/19/2004 02:25 PM To dfdl-wg(a)gridforum.org cc Subject RE: [dfdl-wg] Annotation complexity I guess I'm not sure how restricting annotations to elements solves things. I think I can recreate the problems in Martin's examples without putting annotations on types: The issue of it being hard to understand that triple overrides the dfdlfromstrings param would seem to be the same whether the triple type has an annotation or if some subelements within it get annotations (either first second and third, or consider a triple type that specifies an annotated element containing those three). In all these cases, it is clear that you have to walk down the logical hierarchy which is broken into parts in the dfdl/xsd file and keep a stack of contexts if we allow any default/scoped annotations. If annotations are allowed on both types and elements, what I find even more difficult are situations where the triple type has one default, and the element in "data" with that type has an annotation specifying the opposite param value. Do we consider the element to be above the type in the scope hierarchy? For more fun, what if triple is derived some other type where the annotation is defined. Would an annotation on the "data" element be inherited by the sub-element of type triple, or would the inheritance from the triple base type win (i.e. neither the element of the type triple or the triple type itself are directly annotated). (Or consider an annotation on the "first" element defined in the base type for triple rather than on the base type directly - does the element annotation inherited from the type hierarchy trump the one from the element hierarchy?) An attempt at a picture where only elements have annotations: Element A : param=littleendian SubElement B: type ST Type T: SubElement C: param: bigendian Type ST: subtype of T What is the param value of element C at A/B/C? I guess I see a need to keep some hierarchically scoped defaults (a file that has some ascii info and then a base64 encoded section of littleendian stuff), but xsd makes it hard to define a single hierarchy. Perhaps some rule of precedence - resolve annotations from type to subtype first, then push those onto the stack of element scopes - would make things unambiguous, if not user friendly. Jim > -----Original Message----- > From: owner-dfdl-wg(a)ggf.org [mailto:owner-dfdl-wg@ggf.org] On > Behalf Of Martin Westhead > Sent: Friday, November 19, 2004 1:38 PM > To: Martin Westhead > Cc: dfdl-wg(a)gridforum.org > Subject: Re: [dfdl-wg] Annotation complexity > > > Sorry the elements in the triple were all supposed to be of a simple > type e.g.: > > <xs:complexType name="triple"> > <xs:annotation> > <xs:appinfo> > <dfdlFromBinary/> > </xs:appinfo> > </xs:annotation> > <xs:sequence> > <xs:element name="first" type="xs:int"/> > <xs:element name="second" type="xs:int"/> > <xs:element name="third" type="xs:int"/> > </xs:sequence> > </xs:complexType> > > > <xs:complexType name="data"> > <xs:annotation> > <xs:appinfo> > <dfdlFromStrings/> > </xs:appinfo> > </xs:annotation> > <xs:sequence> > <xs:element name="triple"/> > </xs:sequence> > </xs:complexType> > > Martin Westhead wrote: > > Hi, > > > > I think I understand Suman's issue with annotations on the Schema > > tree. > > (Please Suman tell me if I am right here). The problem is, that > > lexically there are many trees in an XSD. Whilst in > practice these can > > clearly be considered as a single tree (including, I think, > even the > > simple type hierarchies) by placing all the type > definitions inline, > > this is not the way they appear to the user. So for example > if I have a > > file with conflicting annotations looking like: > > > > <xs:complexType name="triple"> > > <xs:annotation> > > <xs:appinfo> > > <dfdlFromBinary/> > > </xs:appinfo> > > </xs:annotation> > > <xs:sequence> > > <xs:element name="first" type="xs:int"/> > > <xs:element name="second"/> > > <xs:element name="third"/> > > </xs:sequence> > > </xs:complexType> > > > > > > <xs:complexType name="data"> > > <xs:annotation> > > <xs:appinfo> > > <dfdlFromStrings/> > > </xs:appinfo> > > </xs:annotation> > > <xs:sequence> > > <xs:element name="triple"/> > > </xs:sequence> > > </xs:complexType> > > > > So what I imagined is that we would assume that the "triple" type is > > considered _inside_ the scope of the "data" type and so the > > "dfdlFromBinary" tag wins. > > > > On the other hand the user sees two trees of equal depth with > > conflicting annotations. The examples can obviously get much more > > intricate. > > > > The issue is really that the scope of the annotations is > not lexically > > defined. At some level this is just like having globally included > > variables in a programming language. On the other hand we > have arbitrary > > levels of these. > > > > Suman is this the problem? > > > > If this is the problem, and we agree that it is too confusing to the > > user (my opinion is still out on this). Then I see that the > conclusion > > is to adopt an approach similar to IBM's that annotations > can appear > > only on <element> and <attribute> tags. Even the top level > of the file > > is confusing since there may be many files involved. I > guess we can also > > have runtime defaults and default settings set in the > standard. I don't > > like this conclusion incidentally, can someone convince me > it is the > > wrong one? > > > > Martin > > > > > > > > > > > > > > > > > >

1 0