Hi Jim,

I think the others should be implicitly used because of order and type.

Sorry tokenizer should be split – unpropagated change.

Chartostring (which I called concatenate) is to be used first (because it is the more specific match).

EOS is up for grabs I was thinking of it as a returned value (e.g. -1) but an exception might (or might not) be easier to make sense of.

Regarding the new model. I don’t think this is a problem at the level of your example. We could simply use a single sequence and a more complex “split” conversion. I imagine that the “split” conversion we would want to settle on should accept a regular expression (or at least a list of separators). In your example you just have to allow the separator to be a new line OR a comma and you are done.

A note here this is intended as a rough sketch not a finished design. I am expecting the details to need to be worked out here. In particular I think Mike/IBM have some fairly complex ideas for separator/terminator/initiator/escape that we will have to try to seat in this framework.

Thanks,

Martin

From: Jim Myers [mailto:jimmyers@ncsa.uiuc.edu]
Sent: Wednesday, March 01, 2006 3:49 AM
To: Westhead, Martin (Martin); dfdl-wg@ggf.org
Subject: Re: [dfdl-wg] CSV string worked example

Martin - two types of comments - things I think are typos/inconsistencies and an alternate logic:

Clarifications:
are the initial definitions on the top element defining an order to use subsequently or are they just there for us to see what you've defined?
Of the four there, you only explicitly (in a comment?) invoke one - are the others implicit because of the order?
You use dfdl:tokenizer as a conversion later - is that supposed to be split as well?
bytetochar is used implicitly before the first split?
chartostring is used implicitly before stringtoint which is implicitly used to get the int element?
is EOS a returned value (and therefore of the type being returned) or is it an exception?

Logical - what happens if the rows are not in the logical model - physically there are 10 rows with 5 elements, but the logical model is 50 ints in a single sequence. To support this, you'd need to have both tokenization steps in one sequence annotation with two separate split separators - does the use of setLocal for split separator work in this case? (Is this how byteorder is now used?)
Thinking about missing values - is it clear how a missing row versus a missing element is now handled (I think so) - the split conversion using comma can define a default input to use if the stream it recieves is empty (from a \n\n pair) and the stringtoint conversion can do likewise to cover a ,, pair.

Jim

At 09:25 PM 2/28/2006, Westhead, Martin (Martin) wrote:

Hi Folks,

I have tried to work through the CSV example that Mike suggested a couple of weeks ago. It has turned up some interesting issues which I have tried to address. These are less about making the underlying semantics work and more about providing a seamless default set up that makes the easy things work just as you would like.

I was pushed for time on this so I apologies if this is unclear in places, but I wanted to put it out before tomorrow’s meeting.

Thanks,

Martin

James D. Myers
Associate Director, Cyberenvironments and Technologies, NCSA
1205 W. Clark St, MC-257
Urbana, IL 61801
217-244-1934
jimmyers@ncsa.uiuc.edu