postfix separators, terminators, finalTerminatorCanBeMissing

I would like to explore the semantics of separators and terminators, and raise a question about consistency with regard to toleration of missing separators/terminators. Sorry for the barrage of questions lately - the implementation is uncovering some new angles. Relevant snippets from v0.36 of spec: Section 14.2 Text Markup The terminator region contains the terminator string. When a terminator is expected it is a processing error if one of the values is not found. However, if dfdl:finalTerminatorCanBeMissing is specified then it is not an error if the terminator is not found. ... When the finalTerminatorCanBeMissing property is true, then when an element is the last element in a sequence or array, then on input, it is not a parse error if the terminator is not found but end-of-parent or an enclosing delimiter is encountered instead. Section 17.3 Sequence groups with delimiters The separator region contains one of the strings specified by the dfdl:separator property. When this property has "" (empty string) as its value then the separator region is of length zero. ... ‘postfix’ means the separator occurs after each element. On parsing the separator after the last item is optional. On unparsing the final separator will always be written. Section 17.3.1 Sequence groups and separators re: ordered/suppressAtEnd : All separators must be found in the data except that when the sequence has trailing optional items, the separators are suppressed for any final missing items. My interpretation of the spec: a) If an element's parent group defines a separator, that separator might not appear after the element. Instead, the group might be terminated early by the group's own terminator, or by the separator/terminator of an enclosing element/group or by end-of-data. b) On the other hand, if an element defines a terminator, that terminator *must* appear after the element unless FTCBM="true" ( in which case the element and its parent group can be terminated early by enclosing markup or end of data ) c) separatorPosition="postfix" is not enforced rigidly. The input document can always be constructed as if separatorPosition="infix" and the parser will not complain. This allows early termination of a separated group by enclosing markup, as well as by end-of-data. d) The FTCBM flag allows the terminator of the final group member to be missing. This allows early termination of the group by enclosing markup or by end-of-data. I have reservations about these rules. - It seems overly lax to unconditionally allow 'postfix' to behave like 'infix'. The equivalent flexibility for a terminator requires FTCBM to be set to "true". - FTCBM is not as useful as it seems because it only applies to the final group member. If the final group member is optional, the user will be forced to use a postfix separator, and will then lose the control afforded by FTCBM. - DFDL needs to allow strict validation of postfix separators/terminators. I can't see a way to achieve that with the current rules ( see example below) Example: Lines are separated by <lf>. Lines have up to 3 fields. Fields can be empty. Fields are always terminated by a *. line:field1*field2*field3*<lf> line:field1*field2*<lf> line:field1**field3*<lf> With the current rules, this form of the second line line:field1*field2<lf> ...would also be allowed: ( assuming that the * is defined as a postfix separator with separatorPolicy="suppressAtEnd" ) Note that the missing * after field2 is silently tolerated because postfix separators are allowed to be omitted. To enforce the presence of the * after field2 it would have to be defined as a terminator on every field. But that would remove the flexibility afforded by the use of separators ( see third line ) A possible solution: - Strictly enforce separatorPosition="postfix". - Make terminators mandatory - Remove the FTCBM flag, and replace it with a flag which tolerates end-of-data where any separator/terminator was expected. The definition of end-of-data would include the end of a defined-length parent element, but would specifically exclude end-of-parent caused by enclosing markup ( because that would re-introduce the ambiguity which I'm trying to avoid ). These rules are considerably tighter than the existing ones, but I don't think they make anything impossible. I do think they make the meaning of the various settings a lot simpler. Terminators would be less 'optional' than before, but I suspect that the real-world scenarios would be catered for. Anyway - comments invited. ( invitation unnecessary, I suspect ) regards, Tim Kimber, Common Transformation Team, Hursley, UK Internet: kimbert@uk.ibm.com Tel. 01962-816742 Internal tel. 246742 Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
participants (1)
-
Tim Kimber