Hi Bradley
Yes dfdl:lengthKind "pattern"
is the ideal way to model this.
I'm struggling to find a way to model
this that preserves the nested groups and separates the trailing data from
the control word. However if you were prepared to lose the group structure
and treat the trailing data as part of the control word, then you could
model a completely flat structure with the various delimiters interpreted
as a prefix separator.
dfdl:separator="\
}\ }}\ }}}\ {\ }{\ }}{\ }}}{\" dfdl:separatorPosition="prefix"
That would give you an infoset like:
<file>
<controlWord>rtf1</controlWord>
<controlWord>ansi</controlWord>
<controlWord>ansicpg1252</controlWord>
<controlWord>deff0</controlWord>
<controlWord>deflang1033</controlWord>
<controlWord>fonttbl</controlWord>
<controlWord>f0</controlWord>
<controlWord>froman</controlWord>
<controlWord>fprq2</controlWord>
<controlWord>fcharset0 Times
New Roman;</controlWord>
<controlWord>f1</controlWord>
<controlWord>fswiss</controlWord>
<controlWord>fcharset0 Arial;</controlWord>
<controlWord>*</controlWord>
<controlWord>generator Msftedit 5.41.15.1515;</controlWord>
<controlWord>viewkind4</controlWord>
<controlWord>uc1</controlWord>
<controlWord>pard</controlWord>
<controlWord>f0</controlWord>
<controlWord>fs24 This is an example
document of an RTF file.</controlWord>
<controlWord>f1</controlWord>
<controlWord>fs20</controlWord>
<controlWord>par</controlWord>
<controlWord>*</controlWord>
<controlWord>passwordhash 010000004c000000010000000480000050c3.
. .</controlWord>
</file>
Not ideal. I'll carry on thinking about
the problem.
If you like I'll add you to the invite
list for the DFDL WG call next Tuesday and we can discuss further?
Regards
Steve Hanson
Architect, Data Format Description Language (DFDL)
Co-Chair, OGF
DFDL Working Group
IBM SWG, Hursley, UK
smh@uk.ibm.com
tel:+44-1962-815848
From:
Bradley Sexton <bradley.r.sexton@gmail.com>
To:
dfdl-wg@ogf.org
Date:
23/02/2012 19:07
Subject:
[DFDL-WG] DFDL
Modeling Question
Sent by:
dfdl-wg-bounces@ogf.org
Hello,
I've been looking at modeling Rich Text Format (RTF) files
using the IBM Message Broker DFDL implementation, and ran into an issue.
For some background, here's a small example of an RTF file:
{\rtf1\ansi\ansicpg1252\deff0\deflang1033{\fonttbl{\f0\froman\fprq2\fcharset0
Times New Roman;}{\f1\fswiss\fcharset0 Arial;}}{\*\generator Msftedit 5.41.15.1515;}\viewkind4\uc1\pard\f0\fs24
This is an example document of an RTF file.\f1\fs20\par{\*\passwordhash
010000004c000000010000000480000050c300001400000010000000f89c360d0c9d360d000000008bc29e2f78a2144122ed68a1701e2ea50bbbbeaf7333c40dfe048ccf55f709b8cc7e8b49}}
'\' and '\*\' mark the beginning of control words, and
the curly braces mark the beginning and end of control groups that contain
control words and data. My issue is that control words and data do not
have suitable terminators for parsing. The end of control words is signified
by a space when trailing data is present, but typically they are ended
by '\' signalling the beginning of a new word or a curly brace signalling
the end of the current of beginning of a new control group. Similarly data
is typically ended by the '}' of the parent control group.
With the exception of a small header the value and placement
of control words, groups, and data varies by file.
My issue with modeling this is that I was going to use
dfdl:lengthKind="pattern" in lieu of suitable delimiters, but
this feature is not implemented by IBM. I'm looking for an alternative
way to model the data, and was hoping someone on the mailing list might
have suggestions. My goal is to model control words and groups in as general
a manner as possible given IBMs implementation restrictions, since
RTF has over 1800 defined control words and gives you the ability to create
your own.
Ideal output for the above sample would be something along
these lines:
<file>
<controlWord>rtf1</controlWord>
<controlWord>ansi</controlWord>
<controlWord>ansicpg1252</controlWord>
<controlWord>deff0</controlWord>
<controlWord>deflang1033</controlWord>
<controlGroup>
<name>fonttbl</name>
<controlGroup>
<name>f0</name>
<controlWord>froman</controlWord>
<controlWord>fprq2</controlWord>
<controlWord>fcharset0</controlWord>
<data>Times New Roman;</data>
</controlGroup>
<controlGroup>
<name>f1</name>
<controlWord>fswiss</controlWord>
<controlWord>fcharset0</controlWord>
<data>Arial;</data>
</controlGroup>
</controlGroup>
<controlGroup>
<name>generator</name>
<data>Msftedit
5.41.15.1515;</data>
</controlGroup>
<controlWord>viewkind4</controlWord>
<controlWord>uc1</controlWord>
<controlWord>pard</controlWord>
<controlWord>f0</controlWord>
<controlWord>fs24</controlWord>
<text>This is an example document of
an RTF file.</text>
<controlWord>f1</controlWord>
<controlWord>fs20</controlWord>
<controlWord>par</controlWord>
<controlGroup>
<name>passwordhash</name>
<data>010000004c000000010000000480000050c3.
. .</data>
</controlGroup>
</file>
IBM Unsupported Features:
http://publib.boulder.ibm.com/infocenter/wmbhelp/v8r0m0/index.jsp?topic=%2Fcom.ibm.dfdl.editor.messagebroker.doc%2Fdf00150_.html
I know that's a lot of info out of left field, but I wanted
to try and explain it as thoroughly as possible to avoid any confusion.
Thanks in advance for any advice you might have and let me know if I've
been unclear in any areas.
Bradley Sexton--
dfdl-wg mailing list
dfdl-wg@ogf.org
https://www.ogf.org/mailman/listinfo/dfdl-wg
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6
3AU