split into multiple topics - Re: [dfdl-wg] Issues: additional data types - dfdl-wg

newer
Re: [dfdl-wg] Issues: additional...

split into multiple topics - Re: [dfdl-wg] Issues: additional data types

Mike Beckerle

2 Sep 2005 2 Sep '05

8:34 p.m.

I'd like to split this topic into several distinct ones: Arrays - I have a placeholder for this in the doc. Opaque and "code" types are separate. This is related also to the concept of "open content". Enums Bitfields Pointers Mike Beckerle Architect, Scalable Computing IBM Software Group Information Integration Solutions Westborough, MA "Robert E. McGrath" <mcgrath@ncsa.uiuc.edu> Sent by: owner-dfdl-wg@ggf.org 09/02/2005 03:13 PM To dfdl-wg@gridforum.org cc Subject [dfdl-wg] Issues: additional data types Greetings, Here is an "issue" for the DFDL: additional data types that should be considered. Please see attached. --- Robert E. McGrath National Center for Supercomputing Applications University of Illinois, Urbana-Champaign Champaign, Illinois 61820 (217)-333-6549 mcgrath@ncsa.uiuc.edu

Attachments:

attachment.html (text/html — 1.9 KB)
DT.htm (text/html — 11.9 KB)

Show replies by date

Mike Beckerle

6 Sep 6 Sep

3:48 p.m.

New subject: Arrays issue - Re: [dfdl-wg] Issues: additional data types

The need for an approach to arrays is clear and is acute to many DFDL constituencies. The first step in any approach to arrays for DFDL is an XML model for array data and an XSD for describing it. Then DFDL can put properties on this. I suggest the following model. Consider a 2-d case. This will generalize to N dimensions. Each axis is named. The array itself is represented as elements, with attributes used to identify the position of the value on each axis conceptually like so: <a x="5" y="-2">51</a> That is, you think of each array element as having attributes identifying its position in the array. Of course DFDL allows data to be processed without ever creating elements like that, so this is a conceptual model only, particularly for a dense array. That element is of an array named 'a', at position x=5, y=-2, having value 51. The declaration in XSD would be like this: <element name="a" maxOccurs="unbounded"> <complexType> <extension base="int"> <simpleContent> <attribute name="x"> <simpleType> <restriction base="int"> <maxInclusive value="5"/> <minInclusive value="-5"/> </restriction> </simpleType> </attribute> <attribute name="y"> <simpleType> <restriction base="int"> <maxInclusive value="10"/> <minInclusive value="-10"/> </restriction> </simpleType> </attribute> </simpleContent> </extension> </complexType> </element> Notice how the ranges of the index values are captured in XSD by use of the simple type restriction, and can cover arbitrary sections of the integer space, including negative indices. DFDL would then provide properties for 1) declaring that 'a' is an array and that 'x' and 'y' are array indices (and therefore do not have values stored anywhere in the data). 2) declaring the storage-order of the array. This can be an ordered list of the dimension names. E.g., "x y" or "y x" depending on which index changes fastest in the storage ordering. Access to elements would be by XPath expressions like this: ..../a[x='5' and y='-2']. Processors would recognize that x and y are array indices based on DFDL annotations and would thereby recognize predicates involving the indices and treat them specially. For example, we could preclude slicing arrays like this: ..../a[x='0'] that is, where the 'y' axis is unconstrained.

Robert E. McGrath

4:31 p.m.

New subject: Arrays issue - Re: [dfdl-wg] Issues: additional data types

Yes, this is one way to do arrays. This approach emphasizes the use case where it is important to access individual elements via XML. There are two obvious down sides: 1. space: this will be >10 times the storage of the actual numbers. A big problem for many cases. 2. array algorithms (e.g., scatter-gather, transpose) do block operations which are totally ugly in this markup. A variant of this might mark up parts of the array, e.g., each row. Two other general approaches can be considered: Array as blob: markup says 'this is an array, laid out like so', data is a big blob. (Probably this is what Jim is talking about) Array as external blob: same as above, except payload is a URL, e.g., to OpenDAP server where the data is. (Ideal for "virtual datasets") The memo I was working on tries to lay these options out with the advantages and disadvantages. --- Robert E. McGrath National Center for Supercomputing Applications University of Illinois, Urbana-Champaign Champaign, Illinois 61820 (217)-333-6549 mcgrath@ncsa.uiuc.edu

Jim Myers

4:44 p.m.

New subject: Arrays issue - Re: [dfdl-wg] Issues: additional data types

...

Array as blob: markup says 'this is an array, laid out like so', data is a big blob. (Probably this is what Jim is talking about)

Yes

...

Array as external blob: same as above, except payload is a URL, e.g., to OpenDAP server where the data is. (Ideal for "virtual datasets")

I definitely want to see data virtualization services based on DFDL, but I don't think we need do anything specific in the language to provide externally available access methods for parts of the described data set. Jim

Mike Beckerle

7:48 p.m.

New subject: Arrays issue - Re: [dfdl-wg] Issues: additional data types

I need an example of this blobby array scheme. To me this concept makes no sense. We're embedded in XSD's type system, so if you can talk about arrays at all, then there has to be a logical model for them, and that means a way to talk about them in XSD independent of DFDL annotations. We COULD propose an extension to XSD for arrays. Is that what you mean? ...mikeb Mike Beckerle Architect, Scalable Computing IBM Software Group Information Integration Solutions Westborough, MA Jim Myers <jimmyers@ncsa.uiuc.edu> Sent by: owner-dfdl-wg@ggf.org 09/06/2005 12:44 PM To dfdl-wg@gridforum.org cc Subject Re: Arrays issue - Re: [dfdl-wg] Issues: additional data types

...

Array as blob: markup says 'this is an array, laid out like so', data is a big blob. (Probably this is what Jim is talking about)

Yes

...

Array as external blob: same as above, except payload is a URL, e.g., to OpenDAP server where the data is. (Ideal for "virtual datasets")

Mike Beckerle

4:57 p.m.

New subject: Arrays issue - Re: [dfdl-wg] Issues: additional data types

re: Space - a space penalty only occurs if your DFDL implementation actually converts the data into XML. My personal plans for DFDL would do none of that. You would incur zero space penalty. I want to reemphasize here, that the "index attributes" x and y in my example, would take up exactly zero space. They have no representation. Their values are inferred by the positon of the elements of the array. re: algorithms - DFDL doesn't address APIs for access to data at all. There's nothing stopping someone from making array access appear in a programming language exactly the way it appears in C, Fortran, or Java or any other language today. E.g., Array a = ...getArrayFromDFDL(".../a"); // establish correspondence between Java array 'a', and DFDL-described array reachable via path '..../a'. int value = a(5, -2); // retrieve the element at these index locations If you really want to express transformations "in this markup", i.e., as if the data had been converted to XML, then I'm unclear why XPath/XQuery would make the algorithms particularly ugly. Use of Xpath/Xquery to address elements would be very similar to basic index-oriented access in a programming language. ...mike Mike Beckerle Architect, Scalable Computing IBM Software Group Information Integration Solutions Westborough, MA "Robert E. McGrath" <mcgrath@ncsa.uiuc.edu> Sent by: owner-dfdl-wg@ggf.org 09/06/2005 12:31 PM To Mike Beckerle/Worcester/IBM@IBMUS cc dfdl-wg@gridforum.org Subject Re: Arrays issue - Re: [dfdl-wg] Issues: additional data types Yes, this is one way to do arrays. This approach emphasizes the use case where it is important to access individual elements via XML. There are two obvious down sides: 1. space: this will be >10 times the storage of the actual numbers. A big problem for many cases. 2. array algorithms (e.g., scatter-gather, transpose) do block operations which are totally ugly in this markup. A variant of this might mark up parts of the array, e.g., each row. Two other general approaches can be considered: Array as blob: markup says 'this is an array, laid out like so', data is a big blob. (Probably this is what Jim is talking about) Array as external blob: same as above, except payload is a URL, e.g., to OpenDAP server where the data is. (Ideal for "virtual datasets") The memo I was working on tries to lay these options out with the advantages and disadvantages. --- Robert E. McGrath National Center for Supercomputing Applications University of Illinois, Urbana-Champaign Champaign, Illinois 61820 (217)-333-6549 mcgrath@ncsa.uiuc.edu

Robert E. McGrath

7:55 p.m.

New subject: Arrays issue - Re: [dfdl-wg] Issues: additional data types

On Tue, 6 Sep 2005, Mike Beckerle wrote: In case there is any misuderstanding: I think Mike's approach is fine. It focuses on what I think is the main issue, which is how to talk about the indexing. IMO, the main point of an array is that it is a blob with an index scheme.

...

re: Space - a space penalty only occurs if your DFDL implementation actually converts the data into XML. My personal plans for DFDL would do none of that. You would incur zero space penalty. I want to reemphasize here, that the "index attributes" x and y in my example, would take up exactly zero space. They have no representation. Their values are inferred by the positon of the elements of the array.

Sure. But sometimes people need to pass the data as XML, and if they do, it's going to be an issue.

...

re: algorithms - DFDL doesn't address APIs for access to data at all.

My earlier email had several unstated assumptions. I'm approaching this from a grid perspective, especially, service composition. The 'algorithms' I'm thinking of are what a client does to construct a request for specific elements of an array from a server, and what the server must do to provide the data in the way the requester wants it. I'm assuming DFDL is used in this protocol.

Mike Beckerle

4:08 p.m.

New subject: Pointers - was: Re: split into multiple topics - Re: [dfdl-wg] Issues: additional data types

There are two topics mixed together under "pointers" in my mind. 1) data contains a "pointer" or "address" in a given location, and we want to abstractly describe things like how big it is. For example, if my file format comes from a C language struct containing: int x; char* y; how big is the char* element 'y' anyway? Could be 4 bytes, could be 8 bytes. 2) reconstructing the pointer relationships within data. That is, the data is conceptually a graph of objects with pointers to each other. We want not only to access these pointers within the data but be able to traverse them in order to reference other objects within the data. Now (1) in the absence of (2) is a matter of just expressing how big the data we'll be ignoring and skipping over is. This is pretty easy to resolve. (2) is trickier. I suggest a proposal for pointers in DFDL should begin with an analysis of the approaches to pointers in XML and XSD, in particular ID, IDREF - in basic XML unique, key, keyref - that is, what XSD calls "Identity Constraints" I'd like to see annotations to unique, key, and keyref allowing these logical XSD concepts to be mapped into addresses and pointers within the data. Mike Beckerle Architect, Scalable Computing IBM Software Group Information Integration Solutions Westborough, MA Mike Beckerle/Worcester/IBM@IBMUS Sent by: owner-dfdl-wg@ggf.org 09/02/2005 04:34 PM To "Robert E. McGrath" <mcgrath@ncsa.uiuc.edu> cc dfdl-wg@gridforum.org, owner-dfdl-wg@ggf.org Subject split into multiple topics - Re: [dfdl-wg] Issues: additional data types I'd like to split this topic into several distinct ones: Arrays - I have a placeholder for this in the doc. Opaque and "code" types are separate. This is related also to the concept of "open content". Enums Bitfields Pointers Mike Beckerle Architect, Scalable Computing IBM Software Group Information Integration Solutions Westborough, MA "Robert E. McGrath" <mcgrath@ncsa.uiuc.edu> Sent by: owner-dfdl-wg@ggf.org 09/02/2005 03:13 PM To dfdl-wg@gridforum.org cc Subject [dfdl-wg] Issues: additional data types Greetings, Here is an "issue" for the DFDL: additional data types that should be considered. Please see attached. --- Robert E. McGrath National Center for Supercomputing Applications University of Illinois, Urbana-Champaign Champaign, Illinois 61820 (217)-333-6549 mcgrath@ncsa.uiuc.edu

Robert E. McGrath

8:50 p.m.

New subject: Pointers - was: Re: split into multiple topics - Re: [dfdl-wg] Issues: additional data types

On Tue, 6 Sep 2005, Mike Beckerle wrote:

...

(2) is trickier. I suggest a proposal for pointers in DFDL should begin with an analysis of the approaches to pointers in XML and XSD, in particular

...

ID, IDREF - in basic XML These are only used in attributes, they point to named entities.

...

unique, key, keyref - that is, what XSD calls "Identity Constraints" These operate on Xpaths.

...

I'd like to see annotations to unique, key, and keyref allowing these logical XSD concepts to be mapped into addresses and pointers within the data.

So how would this work? I'm thinking of something like a list (index). A C data structure is something like: struct thing { struct thing * next; struct blob payload; } In the file, there are a bunch (not necessarily contiguous!) records, with a pointer to the next one and a blob. next -> next -> NULL blob1 blob2 How would you want to describe this data?

Mike Beckerle

4:40 p.m.

New subject: Enums - Re: split into multiple topics - Re: [dfdl-wg] Issues: additional data types

About enums. Here's starting thoughts: Here's a real-world example from COBOL: 01 AS-CPST-REC. 06 AS-CPCOM. 09 AS-COM-STORE-TYPE PIC X. 09 AS-COM-STORE-NO PIC 9(05). 09 AS-COM-TRAN-ID PIC X(04). 88 TRAN-COUPON VALUE 'CP80'. 88 TRAN-REVENUE VALUE 'IC40' 'RA40'. 88 TRAN-SALES VALUE 'IC40'. 88 TRAN-DELIVER VALUE 'IC44'. 88 TRAN-RENTS VALUE 'RA40' 'RA42'. 88 TRAN-RENT-RETURN VALUE 'RA41'. 09 AS-COM-QUANTITY PIC S9(05). 09 AS-COM-PART-NO PIC 9(06). .... more fields elided .... Those "88" entries in there are enumerated constants. Note that for the TRAN-REVENUE and TRAN-RENTS constants, multiple values are associated with the same name. On reference this means that the constant matches either value. When written, this means the first value is used. Cobol doesn't strongly associate these enumerated values with the field to which they can be assigned, but usually it's obvious. In this case it is the AS-COM-TRAN-ID field which is the 4-character-long string which has the string constants associated with it. This particular example is a common one. The record has variant structure (not shown above) depending on the tag field which is this AS-COM-TRAN-ID field. Working out how we want this example to work in XSD is the first step. I like using a hidden field here. For example: Here's a possible idea for how this is represented in XSD: <xs:element name="AS-COM-TRAN-ID">  <xs:simpleType> <xs:restriction base="xs:NCName"> <xs:enumeration value="TRAN-COUPON"/> <xs:enumeration value="TRAN-REVENUE"/> <xs:enumeration value="TRAN-SALES"/> <xs:enumeration value="TRAN-DELIVER"/> <xs:enumeration value="TRAN-RENTS"/> </xs:restriction> </xs:simpleType> </xs:element> <xs:sequence> <xs:annotation> <xs:appinfo source="http://dataformat.org/"> <xs:layer name="rep" type="AS-COM-TRAN-ID-repType"/>  </xs:appinfo> </xs:annotation> </xs:sequence> <xs:simpleType name="AS-COM-TRAN-ID-repType"> <xs:restriction base="xs:string> <xs:enumeration value="CP80"/> <xs:enumeration value="IC40"/>  .... </xs:restriction> </xs:simpleType> Now in an annotation (not shown above) on the element AS-COM-TRAN-ID there would be a dfdl:valueCalc property which would compute the value of AS-COM-TRAN-ID based on the value of the hidden field. Symmetrically, the 'rep' hidden field would have a dfdl:repCalc property which would give the inverse formula for output. One difficulty I have with this is the notion that we're projecting into the string type. I.e., these symbolic constants aren't names for integers, but rather we're expressing operations on strings. In the above example the enumerated constants actually are strings, but in other examples they would be integers. The next tier of interpretation, i.e., where we're decidng the variant based on the value of AS-COM-TRAN-ID would be expressed as string comparisons which is potentially inefficient. This is part of the problem with using XSD as our type system basis. XSD doesn't have a notion of symbolic named constant. Alternative: There is the DTD named entity stuff. Does anybody want to propose that? Mike Beckerle Architect, Scalable Computing IBM Software Group Information Integration Solutions Westborough, MA Mike Beckerle/Worcester/IBM@IBMUS Sent by: owner-dfdl-wg@ggf.org 09/02/2005 04:34 PM To "Robert E. McGrath" <mcgrath@ncsa.uiuc.edu> cc dfdl-wg@gridforum.org, owner-dfdl-wg@ggf.org Subject split into multiple topics - Re: [dfdl-wg] Issues: additional data types I'd like to split this topic into several distinct ones: Arrays - I have a placeholder for this in the doc. Opaque and "code" types are separate. This is related also to the concept of "open content". Enums Bitfields Pointers Mike Beckerle Architect, Scalable Computing IBM Software Group Information Integration Solutions Westborough, MA "Robert E. McGrath" <mcgrath@ncsa.uiuc.edu> Sent by: owner-dfdl-wg@ggf.org 09/02/2005 03:13 PM To dfdl-wg@gridforum.org cc Subject [dfdl-wg] Issues: additional data types Greetings, Here is an "issue" for the DFDL: additional data types that should be considered. Please see attached. --- Robert E. McGrath National Center for Supercomputing Applications University of Illinois, Urbana-Champaign Champaign, Illinois 61820 (217)-333-6549 mcgrath@ncsa.uiuc.edu

Steve Hanson

5:21 p.m.

New subject: Enums - Re: split into multiple topics - Re: [dfdl-wg] Issues: additional data types

The MRM model COBOL importer gives the user an option to create xs:enumeration values from level 88 values, and that is all. This is so the MRM parser can validate a byte stream against the model. This is in keeping with the rule that values supplied with metadata are part of the logical model, not the physical model. What's the motivation for doing anything more than this? Using your example that gives: <xs:element name="AS-COM-TRAN-ID"> <xs:simpleType> <xs:restriction base="xs:string"> <xs:enumeration value="CP80"/> <xs:enumeration value="IC40"/> ... </xs:restriction> </xs:simpleType> </xs:element> Regards, Steve Steve Hanson WebSphere Business Integration Brokers, IBM Hursley, England Internet: smh@uk.ibm.com Phone (+44)/(0) 1962-815848 Mike Beckerle <beckerle@us.ibm. com> To Sent by: dfdl-wg@gridforum.org owner-dfdl-wg@ggf cc .org Subject Enums - Re: split into multiple 06/09/2005 17:40 topics - Re: [dfdl-wg] Issues: additional data types About enums. Here's starting thoughts: Here's a real-world example from COBOL: 01 AS-CPST-REC. 06 AS-CPCOM. 09 AS-COM-STORE-TYPE PIC X. 09 AS-COM-STORE-NO PIC 9(05). 09 AS-COM-TRAN-ID PIC X(04). 88 TRAN-COUPON VALUE 'CP80'. 88 TRAN-REVENUE VALUE 'IC40' 'RA40'. 88 TRAN-SALES VALUE 'IC40'. 88 TRAN-DELIVER VALUE 'IC44'. 88 TRAN-RENTS VALUE 'RA40' 'RA42'. 88 TRAN-RENT-RETURN VALUE 'RA41'. 09 AS-COM-QUANTITY PIC S9(05). 09 AS-COM-PART-NO PIC 9(06). .... more fields elided .... Those "88" entries in there are enumerated constants. Note that for the TRAN-REVENUE and TRAN-RENTS constants, multiple values are associated with the same name. On reference this means that the constant matches either value. When written, this means the first value is used. Cobol doesn't strongly associate these enumerated values with the field to which they can be assigned, but usually it's obvious. In this case it is the AS-COM-TRAN-ID field which is the 4-character-long string which has the string constants associated with it. This particular example is a common one. The record has variant structure (not shown above) depending on the tag field which is this AS-COM-TRAN-ID field. Working out how we want this example to work in XSD is the first step. I like using a hidden field here. For example: Here's a possible idea for how this is represented in XSD: <xs:element name="AS-COM-TRAN-ID">  <xs:simpleType> <xs:restriction base="xs:NCName"> <xs:enumeration value="TRAN-COUPON"/> <xs:enumeration value="TRAN-REVENUE"/> <xs:enumeration value="TRAN-SALES"/> <xs:enumeration value="TRAN-DELIVER"/> <xs:enumeration value="TRAN-RENTS"/> </xs:restriction> </xs:simpleType> </xs:element> <xs:sequence> <xs:annotation> <xs:appinfo source="http://dataformat.org/"> <xs:layer name="rep" type="AS-COM-TRAN-ID-repType"/>  </xs:appinfo> </xs:annotation> </xs:sequence> <xs:simpleType name="AS-COM-TRAN-ID-repType"> <xs:restriction base="xs:string> <xs:enumeration value="CP80"/> <xs:enumeration value="IC40"/>  .... </xs:restriction> </xs:simpleType> Now in an annotation (not shown above) on the element AS-COM-TRAN-ID there would be a dfdl:valueCalc property which would compute the value of AS-COM-TRAN-ID based on the value of the hidden field. Symmetrically, the 'rep' hidden field would have a dfdl:repCalc property which would give the inverse formula for output. One difficulty I have with this is the notion that we're projecting into the string type. I.e., these symbolic constants aren't names for integers, but rather we're expressing operations on strings. In the above example the enumerated constants actually are strings, but in other examples they would be integers. The next tier of interpretation, i.e., where we're decidng the variant based on the value of AS-COM-TRAN-ID would be expressed as string comparisons which is potentially inefficient. This is part of the problem with using XSD as our type system basis. XSD doesn't have a notion of symbolic named constant. Alternative: There is the DTD named entity stuff. Does anybody want to propose that? Mike Beckerle Architect, Scalable Computing IBM Software Group Information Integration Solutions Westborough, MA Mike Beckerle/Worcester/IBM@IBMUS Sent by: To owner-dfdl-wg@ggf.org "Robert E. McGrath" <mcgrath@ncsa.uiuc.edu> cc 09/02/2005 04:34 PM dfdl-wg@gridforum.org, owner-dfdl-wg@ggf.org Subject split into multiple topics - Re: [dfdl-wg] Issues: additional data types I'd like to split this topic into several distinct ones: Arrays - I have a placeholder for this in the doc. Opaque and "code" types are separate. This is related also to the concept of "open content". Enums Bitfields Pointers Mike Beckerle Architect, Scalable Computing IBM Software Group Information Integration Solutions Westborough, MA "Robert E. McGrath" <mcgrath@ncsa.uiuc.edu> Sent by: owner-dfdl-wg@ggf.org To dfdl-wg@gridforum.org 09/02/2005 03:13 PM cc Subject [dfdl-wg] Issues: additional data types Greetings, Here is an "issue" for the DFDL: additional data types that should be considered. Please see attached. --- Robert E. McGrath National Center for Supercomputing Applications University of Illinois, Urbana-Champaign Champaign, Illinois 61820 (217)-333-6549 mcgrath@ncsa.uiuc.edu (See attached file: DT.htm)

Robert E. McGrath

5:49 p.m.

New subject: Enums - Re: split into multiple topics - Re: [dfdl-wg] Issues: additional data types

On Tuesday 06 September 2005 12:21, Steve Hanson wrote:

...

The MRM model COBOL importer gives the user an option to create xs:enumeration values from level 88 values, and that is all. This is so the MRM parser can validate a byte stream against the model. This is in keeping with the rule that values supplied with metadata are part of the logical model, not the physical model. What's the motivation for doing anything more than this?

Sorry, I don't know what "the logical model" vs. "the physical model" means. IMO, the goal is to have a way for a producer to generate data (e.g., with 88s as discussed below), which a consumer can interpret as either 88 or symbols, depending on the purpose of the consumer.

...

Using your example that gives:

<xs:element name="AS-COM-TRAN-ID"> <xs:simpleType> <xs:restriction base="xs:string"> <xs:enumeration value="CP80"/> <xs:enumeration value="IC40"/> ... </xs:restriction> </xs:simpleType> </xs:element>

Regards, Steve

Steve Hanson WebSphere Business Integration Brokers, IBM Hursley, England Internet: smh@uk.ibm.com Phone (+44)/(0) 1962-815848

Mike Beckerle <beckerle@us.ibm. com> To Sent by: dfdl-wg@gridforum.org owner-dfdl-wg@ggf cc .org Subject Enums - Re: split into multiple 06/09/2005 17:40 topics - Re: [dfdl-wg] Issues: additional data types

About enums. Here's starting thoughts:

Here's a real-world example from COBOL:

01 AS-CPST-REC. 06 AS-CPCOM. 09 AS-COM-STORE-TYPE PIC X. 09 AS-COM-STORE-NO PIC 9(05). 09 AS-COM-TRAN-ID PIC X(04). 88 TRAN-COUPON VALUE 'CP80'. 88 TRAN-REVENUE VALUE 'IC40' 'RA40'. 88 TRAN-SALES VALUE 'IC40'. 88 TRAN-DELIVER VALUE 'IC44'. 88 TRAN-RENTS VALUE 'RA40' 'RA42'. 88 TRAN-RENT-RETURN VALUE 'RA41'. 09 AS-COM-QUANTITY PIC S9(05). 09 AS-COM-PART-NO PIC 9(06). .... more fields elided ....

Those "88" entries in there are enumerated constants. Note that for the TRAN-REVENUE and TRAN-RENTS constants, multiple values are associated with the same name. On reference this means that the constant matches either value. When written, this means the first value is used. Cobol doesn't strongly associate these enumerated values with the field to which they can be assigned, but usually it's obvious. In this case it is the AS-COM-TRAN-ID field which is the 4-character-long string which has the string constants associated with it.

This particular example is a common one. The record has variant structure (not shown above) depending on the tag field which is this AS-COM-TRAN-ID field.

Working out how we want this example to work in XSD is the first step.

I like using a hidden field here. For example: Here's a possible idea for how this is represented in XSD:

<xs:element name="AS-COM-TRAN-ID">  <xs:simpleType> <xs:restriction base="xs:NCName"> <xs:enumeration value="TRAN-COUPON"/> <xs:enumeration value="TRAN-REVENUE"/> <xs:enumeration value="TRAN-SALES"/> <xs:enumeration value="TRAN-DELIVER"/> <xs:enumeration value="TRAN-RENTS"/> </xs:restriction> </xs:simpleType> </xs:element> <xs:sequence> <xs:annotation> <xs:appinfo source="http://dataformat.org/"> <xs:layer name="rep" type="AS-COM-TRAN-ID-repType"/>  </xs:appinfo> </xs:annotation> </xs:sequence>

<xs:simpleType name="AS-COM-TRAN-ID-repType"> <xs:restriction base="xs:string> <xs:enumeration value="CP80"/> <xs:enumeration value="IC40"/>  .... </xs:restriction> </xs:simpleType>

Now in an annotation (not shown above) on the element AS-COM-TRAN-ID there would be a dfdl:valueCalc property which would compute the value of AS-COM-TRAN-ID based on the value of the hidden field. Symmetrically, the 'rep' hidden field would have a dfdl:repCalc property which would give the inverse formula for output.

One difficulty I have with this is the notion that we're projecting into the string type. I.e., these symbolic constants aren't names for integers, but rather we're expressing operations on strings. In the above example the enumerated constants actually are strings, but in other examples they would be integers. The next tier of interpretation, i.e., where we're decidng the variant based on the value of AS-COM-TRAN-ID would be expressed as string comparisons which is potentially inefficient. This is part of the problem with using XSD as our type system basis. XSD doesn't have a notion of symbolic named constant.

Alternative: There is the DTD named entity stuff. Does anybody want to propose that?

Mike Beckerle Architect, Scalable Computing IBM Software Group Information Integration Solutions Westborough, MA

Mike Beckerle/Worcester/IBM@IBMUS

Sent by: To owner-dfdl-wg@ggf.org "Robert E. McGrath" <mcgrath@ncsa.uiuc.edu> cc 09/02/2005 04:34 PM dfdl-wg@gridforum.org, owner-dfdl-wg@ggf.org Subject split into multiple topics - Re: [dfdl-wg] Issues: additional data types

I'd like to split this topic into several distinct ones:

Arrays - I have a placeholder for this in the doc.

Opaque and "code" types are separate. This is related also to the concept of "open content".

Enums

Bitfields

Pointers

Mike Beckerle Architect, Scalable Computing IBM Software Group Information Integration Solutions Westborough, MA

"Robert E. McGrath" <mcgrath@ncsa.uiuc.edu> Sent by: owner-dfdl-wg@ggf.org To dfdl-wg@gridforum.org 09/02/2005 03:13 PM cc

Subject [dfdl-wg] Issues: additional data types

Greetings,

Here is an "issue" for the DFDL: additional data types that should be considered.

Please see attached.

--- Robert E. McGrath National Center for Supercomputing Applications University of Illinois, Urbana-Champaign Champaign, Illinois 61820 (217)-333-6549

mcgrath@ncsa.uiuc.edu (See attached file: DT.htm)

-- --- Robert E. McGrath National Center for Supercomputing Applications University of Illinois, Urbana-Champaign Champaign, Illinois 61820 (217)-333-6549 mcgrath@ncsa.uiuc.edu

Mike Beckerle

8:21 p.m.

New subject: [dfdl-wg] Opaque/BLOB/Uninterpreted/Raw - also hexBinary - (was Re: split into multiple topics - Re: [dfdl-wg] Issues: additional data types)

re: Opaque or uninterpreted or raw fields. These are sometimes called Blobs, though database people reserve that term for the acronym "BLOB" which stands for Binary Large Object, which has to do with size being too large for the smaller binary SQL type objects. I.e., there's no such thing as a small BLOB in databases. I think in our mailing list we've used blob to mean "opaque bytes" of any size at all. I believe use of the 'hexBinary' type is also probably this same topic. I.e., how to deal with data where you don't know its proper interpretation, though you can express how big it is so that we can at least copy it from place to place. I think there are two choices here. One is just use "occuring" bytes. E.g., here's uniterpreted data of length 1234 bytes: <element name="ignoreMe" type="byte" minOccurs="1234" maxOccurs="1234" dfdl:repType="binary"/> This is a basic binary byte array. I think this works fine as a blob/opaque type. I believe we do not need any other kind of raw/opaque type. If we had one, we'd have to have a way to express its length, and be specific about the units of that length, and the above accomplishes that with pretty much minimum baggage. You name it what you want, i.e, "unused" or "dummy" or "ignore" or whatever you want. We might want an annotation to indicate that this data should not be accessed, to distinguish this case from an actually array of bytes that you DO want to access, but I'm not sure that's worth it. Note that the OMG CAM model does have an access control attribute. Perhaps we can use that. However, I doubt it allows distinguishing copy from access. The alternative is to use the "hexBinary" type for this. In that case we need to express the size in the DFDL annotation: <element name="ignoreMe" type="hexBinary" dfdl:repLength="1234" dfdl:repType="binary"/> I can think of one advantage of hexBinary over the occuring bytes approach, which is suppose you do want to use DFDL in the obvious way to convert data into XML format. Never mind that DFDL is supposed to enable avoiding this, suppose it's what you want to do. Then my above byte array for the "ignoreMe" element ends up as: <ignoreMe>0</ignoreMe><ignoreMe>0</ignoreMe><ignoreMe>0</ignoreMe><ignoreMe>0</ignoreMe><ignoreMe>0</ignoreMe><ignoreMe>0</ignoreMe>....<ignoreMe>0</ignoreMe> Which is big compared to: <ignoreMe>000000000000...00</ignoreMe> which is what we'd get if we allow hexBinary as a type. Note that if we add the hexBinary type, you'll still be able to do it the other way, so the hexBinary notion is not strictly speaking necessary or minimalist. ...mikeb Mike Beckerle Architect, Scalable Computing IBM Software Group Information Integration Solutions Westborough, MA Mike Beckerle/Worcester/IBM@IBMUS Sent by: owner-dfdl-wg@ggf.org 09/02/2005 04:34 PM To "Robert E. McGrath" <mcgrath@ncsa.uiuc.edu> cc dfdl-wg@gridforum.org, owner-dfdl-wg@ggf.org Subject split into multiple topics - Re: [dfdl-wg] Issues: additional data types I'd like to split this topic into several distinct ones: Arrays - I have a placeholder for this in the doc. Opaque and "code" types are separate. This is related also to the concept of "open content". Enums Bitfields Pointers Mike Beckerle Architect, Scalable Computing IBM Software Group Information Integration Solutions Westborough, MA "Robert E. McGrath" <mcgrath@ncsa.uiuc.edu> Sent by: owner-dfdl-wg@ggf.org 09/02/2005 03:13 PM To dfdl-wg@gridforum.org cc Subject [dfdl-wg] Issues: additional data types Greetings, Here is an "issue" for the DFDL: additional data types that should be considered. Please see attached. --- Robert E. McGrath National Center for Supercomputing Applications University of Illinois, Urbana-Champaign Champaign, Illinois 61820 (217)-333-6549 mcgrath@ncsa.uiuc.edu

Mike Beckerle

8:31 p.m.

New subject: Bit fields - (was Re: split into multiple topics - Re: [dfdl-wg] Issues: additional data types)

Bit fields issue. I have been expecting bit fields to work like this: <element name="threeBitField" type="xs:int" dfdl:length="3" dfdl:lengthUnitKind="bits"/> That is, bit fields are representational, but not part of the type system. Is this idea sufficient? Alternatively we could add a dfdl:bit type, which is a subtype of xs:byte having values only 0 and 1. Then you could do things like this: <element name="myBits" type="dfdl:bit" maxOccurs="3" minOccurs="3" dfdl:alignment="0"/> I put the alignment tag on to emphasize that there's no padding between the bits. Mike Beckerle/Worcester/IBM@IBMUS Sent by: owner-dfdl-wg@ggf.org 09/02/2005 04:34 PM To "Robert E. McGrath" <mcgrath@ncsa.uiuc.edu> cc dfdl-wg@gridforum.org, owner-dfdl-wg@ggf.org Subject split into multiple topics - Re: [dfdl-wg] Issues: additional data types I'd like to split this topic into several distinct ones: Arrays - I have a placeholder for this in the doc. Opaque and "code" types are separate. This is related also to the concept of "open content". Enums Bitfields Pointers Mike Beckerle Architect, Scalable Computing IBM Software Group Information Integration Solutions Westborough, MA "Robert E. McGrath" <mcgrath@ncsa.uiuc.edu> Sent by: owner-dfdl-wg@ggf.org 09/02/2005 03:13 PM To dfdl-wg@gridforum.org cc Subject [dfdl-wg] Issues: additional data types Greetings, Here is an "issue" for the DFDL: additional data types that should be considered. Please see attached. --- Robert E. McGrath National Center for Supercomputing Applications University of Illinois, Urbana-Champaign Champaign, Illinois 61820 (217)-333-6549 mcgrath@ncsa.uiuc.edu

Robert E. McGrath

9:17 p.m.

New subject: Bit fields - (was Re: split into multiple topics - Re: [dfdl-wg] Issues: additional data types)

On Tuesday 06 September 2005 15:31, Mike Beckerle wrote:

...

Bit fields issue.

I have been expecting bit fields to work like this:

<element name="threeBitField" type="xs:int" dfdl:length="3" dfdl:lengthUnitKind="bits"/>

That is, bit fields are representational, but not part of the type system.

Is this idea sufficient?

Alternatively we could add a dfdl:bit type, which is a subtype of xs:byte having values only 0 and 1. Then you could do things like this:

<element name="myBits" type="dfdl:bit" maxOccurs="3" minOccurs="3" dfdl:alignment="0"/>

I put the alignment tag on to emphasize that there's no padding between the bits.

A tricky case: bitfield across byte boundary. byte 0 byte 1 bit 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 |a a a| b b b b|c c c c c| d | 0 0 0

7251

Age (days ago)

7255

Last active (days ago)

List overview

Download

14 comments

4 participants

participants (4)

Jim Myers
Mike Beckerle
Robert E. McGrath
Steve Hanson