Hi Bradley
I think this would work. Presumably
the controlWord element would be minOccurs='0', maxOccurs='unbounded'?
If so all occurrences are optional, and empty optional elements won't be
added to the infoset. So you won't have unwanted empty elements in the
infoset.
Regards
Steve Hanson
Architect, Data Format Description Language (DFDL)
Co-Chair, OGF
DFDL Working Group
IBM SWG, Hursley, UK
smh@uk.ibm.com
tel:+44-1962-815848
From:
Bradley Sexton <bradley.r.sexton@gmail.com>
To:
Steve Hanson/UK/IBM@IBMGB
Cc:
dfdl-wg@ogf.org, dfdl-wg-bounces@ogf.org
Date:
01/03/2012 14:48
Subject:
Re: [DFDL-WG]
DFDL Modeling Question
Sent by:
dfdl-wg-bounces@ogf.org
After some internal discussion I believe we are going
to put RTF on the shelf for the time being and look at some other formats.
One question did come up that I was hoping someone here might be able to
help with. I was asked if there was a way to flat model RTF such that it
would work for any size file or depth or nested groups, similar to what
Steve proposed earlier:
dfdl:separator="\ }\ }}\ }}}\ {\ }{\ }}{\ }}}{\" dfdl:separatorPosition="prefix"
but suitable for any amount of "}" characters before the "\"
or "{\". A possibility suggested to me was to use:
dfdl:separator="\ { }"
to consider all instances of these
symbols as separators, and in the cases such as "}}{\" consider
the values in between each character as empty or null. If you have
any thoughts on this method or alternatives to a general flat
model they would be greatly appreciated.
Bradley
On Fri, Feb 24, 2012 at 10:31 AM, Bradley Sexton <bradley.r.sexton@gmail.com>
wrote:
Steve,
The order of nested groups is somewhat fluid in RTF, and my
concern is whether or not modeling everything completely flat
would preserve the structure and formatting properly. If you were to modify
the text format in a file such as inserting a comment a new group is created
and any data entered within the comment or previously existing text that
is highlighted by the comment would be moved in new
groups to signify their link.
Feel free to put me down for the WG call, just let me
know the time and call info.
Thanks,
Bradley Sexton
On Thu, Feb 23, 2012 at 4:31 PM, Steve Hanson <smh@uk.ibm.com>
wrote:
Hi Bradley
Yes dfdl:lengthKind "pattern" is the ideal way to model this.
I'm struggling to find a way to model this that preserves the nested groups
and separates the trailing data from the control word. However if you were
prepared to lose the group structure and treat the trailing data as part
of the control word, then you could model a completely flat structure with
the various delimiters interpreted as a prefix separator.
dfdl:separator="\ }\ }}\ }}}\ {\ }{\ }}{\
}}}{\" dfdl:separatorPosition="prefix"
That would give you an infoset like:
<file>
<controlWord>rtf1</controlWord>
<controlWord>ansi</controlWord>
<controlWord>ansicpg1252</controlWord>
<controlWord>deff0</controlWord>
<controlWord>deflang1033</controlWord>
<controlWord>fonttbl</controlWord>
<controlWord>f0</controlWord>
<controlWord>froman</controlWord>
<controlWord>fprq2</controlWord>
<controlWord>fcharset0 Times
New Roman;</controlWord>
<controlWord>f1</controlWord>
<controlWord>fswiss</controlWord>
<controlWord>fcharset0 Arial;</controlWord>
<controlWord>*</controlWord>
<controlWord>generator Msftedit 5.41.15.1515;</controlWord>
<controlWord>viewkind4</controlWord>
<controlWord>uc1</controlWord>
<controlWord>pard</controlWord>
<controlWord>f0</controlWord>
<controlWord>fs24 This is an example
document of an RTF file.</controlWord>
<controlWord>f1</controlWord>
<controlWord>fs20</controlWord>
<controlWord>par</controlWord>
<controlWord>*</controlWord>
<controlWord>passwordhash 010000004c000000010000000480000050c3.
. .</controlWord>
</file>
Not ideal. I'll carry on thinking about the problem.
If you like I'll add you to the invite list for the DFDL WG call next Tuesday
and we can discuss further?
Regards
Steve Hanson
Architect, Data Format Description Language (DFDL)
Co-Chair, OGF
DFDL Working Group
IBM SWG, Hursley, UK
smh@uk.ibm.com
tel:+44-1962-815848
From: Bradley
Sexton <bradley.r.sexton@gmail.com>
To: dfdl-wg@ogf.org
Date: 23/02/2012
19:07
Subject: [DFDL-WG]
DFDL Modeling Question
Sent by: dfdl-wg-bounces@ogf.org
Hello,
I've been looking at modeling Rich Text Format (RTF) files using the IBM
Message Broker DFDL implementation, and ran into an issue. For some background,
here's a small example of an RTF file:
{\rtf1\ansi\ansicpg1252\deff0\deflang1033{\fonttbl{\f0\froman\fprq2\fcharset0
Times New Roman;}{\f1\fswiss\fcharset0 Arial;}}{\*\generator Msftedit 5.41.15.1515;}\viewkind4\uc1\pard\f0\fs24
This is an example document of an RTF file.\f1\fs20\par{\*\passwordhash
010000004c000000010000000480000050c300001400000010000000f89c360d0c9d360d000000008bc29e2f78a2144122ed68a1701e2ea50bbbbeaf7333c40dfe048ccf55f709b8cc7e8b49}}
'\' and '\*\' mark the beginning of control words, and the curly braces
mark the beginning and end of control groups that contain control words
and data. My issue is that control words and data do not have suitable
terminators for parsing. The end of control words is signified by
a space when trailing data is present, but typically they are ended by
'\' signalling the beginning of a new word or a curly brace signalling
the end of the current of beginning of a new control group. Similarly data
is typically ended by the '}' of the parent control group.
With the exception of a small header the value and placement of control
words, groups, and data varies by file.
My issue with modeling this is that I was going to use dfdl:lengthKind="pattern"
in lieu of suitable delimiters, but this feature is not implemented by
IBM. I'm looking for an alternative way to model the data, and was hoping
someone on the mailing list might have suggestions. My goal is to model
control words and groups in as general a manner as possible given
IBMs implementation restrictions, since RTF has over 1800 defined
control words and gives you the ability to create your own.
Ideal output for the above sample would be something along these lines:
<file>
<controlWord>rtf1</controlWord>
<controlWord>ansi</controlWord>
<controlWord>ansicpg1252</controlWord>
<controlWord>deff0</controlWord>
<controlWord>deflang1033</controlWord>
<controlGroup>
<name>fonttbl</name>
<controlGroup>
<name>f0</name>
<controlWord>froman</controlWord>
<controlWord>fprq2</controlWord>
<controlWord>fcharset0</controlWord>
<data>Times
New Roman;</data>
</controlGroup>
<controlGroup>
<name>f1</name>
<controlWord>fswiss</controlWord>
<controlWord>fcharset0</controlWord>
<data>Arial;</data>
</controlGroup>
</controlGroup>
<controlGroup>
<name>generator</name>
<data>Msftedit 5.41.15.1515;</data>
</controlGroup>
<controlWord>viewkind4</controlWord>
<controlWord>uc1</controlWord>
<controlWord>pard</controlWord>
<controlWord>f0</controlWord>
<controlWord>fs24</controlWord>
<text>This is an example document of an RTF file.</text>
<controlWord>f1</controlWord>
<controlWord>fs20</controlWord>
<controlWord>par</controlWord>
<controlGroup>
<name>passwordhash</name>
<data>010000004c000000010000000480000050c3.
. .</data>
</controlGroup>
</file>
IBM Unsupported Features:
http://publib.boulder.ibm.com/infocenter/wmbhelp/v8r0m0/index.jsp?topic=%2Fcom.ibm.dfdl.editor.messagebroker.doc%2Fdf00150_.html
I know that's a lot of info out of left field, but I wanted to try and
explain it as thoroughly as possible to avoid any confusion. Thanks in
advance for any advice you might have and let me know if I've been unclear
in any areas.
Bradley Sexton--
dfdl-wg mailing list
dfdl-wg@ogf.org
https://www.ogf.org/mailman/listinfo/dfdl-wg
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6
3AU
--
dfdl-wg mailing list
dfdl-wg@ogf.org
https://www.ogf.org/mailman/listinfo/dfdl-wg
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6
3AU