Hi Jim,
I think the others should be implicitly
used because of order and type.
Sorry tokenizer should be split –
unpropagated change.
Chartostring (which I called concatenate) is
to be used first (because it is the more specific match).
EOS is up for grabs I was thinking of it
as a returned value (e.g. -1) but an exception might (or might not) be easier
to make sense of.
Regarding the new model. I don’t
think this is a problem at the level of your example. We could simply use a
single sequence and a more complex “split” conversion. I imagine
that the “split” conversion we would want to settle on should accept
a regular expression (or at least a list of separators). In your example you
just have to allow the separator to be a new line OR a comma and you are done.
A note here this is intended as a rough
sketch not a finished design. I am expecting the details to need to be worked out
here. In particular I think Mike/IBM have some fairly complex ideas for
separator/terminator/initiator/escape that we will have to try to seat in this
framework.
Thanks,
Martin
From: Jim Myers
[mailto:jimmyers@ncsa.uiuc.edu]
Sent: Wednesday, March 01, 2006
3:49 AM
To: Westhead, Martin (Martin);
dfdl-wg@ggf.org
Subject: Re: [dfdl-wg] CSV string
worked example
Martin - two types of comments - things I think are
typos/inconsistencies and an alternate logic:
Clarifications:
are the initial definitions on the top element defining an order to use
subsequently or are they just there for us to see what you've defined?
Of the four there, you only explicitly (in a comment?) invoke one - are the
others implicit because of the order?
You use dfdl:tokenizer as a conversion later - is that supposed to be split as
well?
bytetochar is used implicitly before the first split?
chartostring is used implicitly before stringtoint which is implicitly used to
get the int element?
is EOS a returned value (and therefore of the type being returned) or is it an
exception?
Logical - what happens if the rows are not in the logical model - physically
there are 10 rows with 5 elements, but the logical model is 50 ints in a single
sequence. To support this, you'd need to have both tokenization steps in one sequence
annotation with two separate split separators - does the use of setLocal for
split separator work in this case? (Is this how byteorder is now used?)
Thinking about missing values - is it clear how a missing row versus a missing
element is now handled (I think so) - the split conversion using comma can
define a default input to use if the stream it recieves is empty (from a \n\n
pair) and the stringtoint conversion can do likewise to cover a ,, pair.
Jim
At 09:25 PM 2/28/2006, Westhead, Martin (Martin) wrote:
Hi Folks,
I have tried to work through the CSV example that Mike suggested a couple of
weeks ago. It has turned up some interesting issues which I have tried to
address. These are less about making the underlying semantics work and more about
providing a seamless default set up that makes the easy things work just as you
would like.
I was pushed for time on this so I apologies if this is unclear in places, but
I wanted to put it out before tomorrow’s meeting.
Thanks,
Martin
Associate Director, Cyberenvironments and Technologies, NCSA
1205
217-244-1934
jimmyers@ncsa.uiuc.edu