Re: [dfdl-wg] Issues: additional data types

Hi Robert, As I said in my other reply to you, one of the features that we decided upon with DFDL was that the model that we were translating into was the XML data model. So any file described by DFDL should be translatable into a well formed XML document. The XML data model may not be ideal but it is the standard data model. The reason that the types you describe are not in DFDL is that they are not in XML. That is not to say they are not important or should not be addressed. My opinion on these is that they can be built out of the existing DFDL/XML components and that this is the correct way of handle them. The standard should provide a document that describes one or more ways in which these types can be achieved. More inline... Robert E. McGrath wrote: ..snip...
*1. Enum*
This type has a set of <name, value> pairs, e.g., <�Red�, 0>, <�Blue�, 1>, etc. The values are stored in the data, with the name-value pairs stored in metadata.
Note: one use is for localization, using different maps to give localized strings.
*Difficulty*: Low
*Priority:*��� Low
I'm confused about what you want to achieve. If you only store the value (an integer) what is the function of the name, is it just for human beings reading the file, or is there some way it is used programatically? One way to approach this type would be a choice over a series of tags with appropriate aattribute types constrained to hardwired values: <enum name="Red" value="0"/> (Complex type with two attributes, one a string constrained to the single value "Red" the other an integer constrained to the value 0). You could then use the DFDL annotations to ensure that this tag gets picked when zero occurs in the file.
*2. Opaque (tagged)*
This is some kind of non-numeric bit string, with a length and some kind of tag.
This might be used, for example, for 1024-bit encryption keys.� The type means �just pass through the bits�.
Generally, can be used to store any kind of �blob�, which can be objects that are meaningful to specific software.
This can be simulated with unsigned integers, but it may be useful to know that it is not really an integer, or whatever.
*Difficulty*: Low
*Priority:*��� Low
This is just a sequence of bytes... it may need to be hidden (the layering introduces a requirement for it to be possible to be explicit about what is visible at a particular layer). (I don't know what the current favourite way to do this is).
*3. �Code�*
How should �code� be marked up?� It is usually stored in blobs, but it needs a tag so you know how to interpret it.
This is actually a special case of �opaque�.
*Difficulty*: Low
*Priority:*��� Low
I don't understand what you mean.
*4. Bitfield / packed*
This type is bits packed into bytes.
*Difficulty*: Low
*Priority:*��� Low
Agreed. I think we can do this but it should be easier.
*5. Pointer� *
Many times there will be pointers within the data, e.g., to offsets in the file, or to indexes in an array.� This will be critical for storing objects such as lists or trees.
URL�s� and XPATHS are not especially well suited for this.
This can be simulated with unsigned integers, but they need to be �swizzled� when translating, so they need to be tagged.
Note that there might be several types of addressing within the data:
� Offset from zero
� Offset relative to �foo�
The offsets might be in different increments:� bits, bytes, words, elements, etc.
There could be multi-part addresses, e.g., page + offset in page.
*Difficulty*:�� Medium
*Priority: *������High
I spent a long time thinking about pointers at one time. I was unable to come up with anything I felt covered the bases. Perhaps you have enough experience with pointer representations to help us out here. An interesting problem that comes to mind though is what is the XML representation of the pointer value. If it is a tag like: <pointer offset="20" offsetType="bytes" index="5" indexType="float32"/> then all that is needed is to define the metadata conventions that allow that to be correctly interpreted. This is a little unsatisfactory though...
*6. Array*
This is a critical type, must be supported.
There are a lot of issues.
I am preparing a separate memo.
*Difficulty*:�� High
*Priority: *�����Very High
We have talked a lot about arrays. A big issue that there are several ways you may want to represent an array within your XML data model. IMO the right way will depend on how you want to use the data. I think the right way to do this is to have a series of recipes for users to capture array semantics in their DFDL files. Cheers, Martin

+ + + My opinion on these is that they can be built out of the existing + DFDL/XML components and that this is the correct way of handle them. The + standard should provide a document that describes one or more ways in + which these types can be achieved. Precisely. The critical thing is that there is a standard for how to do it, so people can share. + + > *2. Opaque (tagged)* + > + This is just a sequence of bytes... it may need to be hidden (the + layering introduces a requirement for it to be possible to be explicit + about what is visible at a particular layer). (I don't know what the + current favourite way to do this is). Well I think this is NOT a sequence of bytes, it is a single n-byte object. I.e., you should not access individual bytes, you should not add one, etc. You definitely shouldn't convert it to integers, and so on. This is not difficult to to implement, you just need to define a complex type that has a tag plus a payload (which can be a sequence of bytes it you want). But the type is not a subclass of byteseq. + > *3. �Code�* + > [...] + I don't understand what you mean. Suppose the file contains data plus Java classes. The Java classes need to be marked as a blob to be interpreted by JVM. Again, it is not an array of bytes because you can't convert it to integers, etc. Here the important goal is to have a very standard way for a reader to know what is intended. I.e., a well-defined anotation, along with clear instructions about how to use it. Not especially difficult, but needs to be a standard to work. --- Robert E. McGrath National Center for Supercomputing Applications University of Illinois, Urbana-Champaign Champaign, Illinois 61820 (217)-333-6549 mcgrath@ncsa.uiuc.edu

Hmm. If we take your example of a class file, then if I am the Java compiler then I really do want arrays of bytes as this data, and I append to them as well. On the other hand if I'm someone dumping the symbol table of a class file, then I don't want the byte codes to be in arrays of bytes? Why not? I would be nice to use the exact same schema as is used ot create the file. so, this feels inconsistent to me. I don't draw a distinction between a sequence of N bytes and an N-byte represenational-size object of undetermined type. Or rather, the difference is just of the intent of the person using it. Do you want a type where it is simply an error to access individual bytes? I.e., so we can express the length in bytes but without expressing that you should/can access it byte by byte? ...mikeb Mike Beckerle Architect, Scalable Computing IBM Software Group Information Integration Solutions Westborough, MA "Robert E. McGrath" <mcgrath@ncsa.uiuc.edu> Sent by: owner-dfdl-wg@ggf.org 09/06/2005 12:39 PM To Martin Westhead <martinwesthead@yahoo.co.uk> cc dfdl-wg@gridforum.org Subject Re: [dfdl-wg] Issues: additional data types + + + My opinion on these is that they can be built out of the existing + DFDL/XML components and that this is the correct way of handle them. The + standard should provide a document that describes one or more ways in + which these types can be achieved. Precisely. The critical thing is that there is a standard for how to do it, so people can share. + + > *2. Opaque (tagged)* + > + This is just a sequence of bytes... it may need to be hidden (the + layering introduces a requirement for it to be possible to be explicit + about what is visible at a particular layer). (I don't know what the + current favourite way to do this is). Well I think this is NOT a sequence of bytes, it is a single n-byte object. I.e., you should not access individual bytes, you should not add one, etc. You definitely shouldn't convert it to integers, and so on. This is not difficult to to implement, you just need to define a complex type that has a tag plus a payload (which can be a sequence of bytes it you want). But the type is not a subclass of byteseq. + > *3. �Code�* + > [...] + I don't understand what you mean. Suppose the file contains data plus Java classes. The Java classes need to be marked as a blob to be interpreted by JVM. Again, it is not an array of bytes because you can't convert it to integers, etc. Here the important goal is to have a very standard way for a reader to know what is intended. I.e., a well-defined anotation, along with clear instructions about how to use it. Not especially difficult, but needs to be a standard to work. --- Robert E. McGrath National Center for Supercomputing Applications University of Illinois, Urbana-Champaign Champaign, Illinois 61820 (217)-333-6549 mcgrath@ncsa.uiuc.edu

Mike Beckerle wrote:
I don't draw a distinction between a sequence of N bytes and an N-byte represenational-size object of undetermined type. Or rather, the difference is just of the intent of the person using it.
I think that this discussion raises important issues of scope and extensibility. I think that it would be reasonable to have an annotation to capture the fact that a sequence of bytes is (usually) to be treated as a block. I think such an annotation would be description metadata. It should be possible to attach rich metadata to the tags to allow an application to get access to information about what it might do with such a block (or indeed any other type). There is an important question as to how much of this metadata we try to define. My take would be as little as possible. The metadata could get pretty involved, you need to know 1) its code 2) its java 3) it requires JVM 1.4+ 4) it needs Swing or whatever. Cheers, Martin
participants (3)
-
Martin Westhead
-
Mike Beckerle
-
Robert E. McGrath