Round-Tripping Specifications
Bob
Stayton
Sagehill Enterprises
Steve
Ball
Explain
1.8
2008-05-22
SRB
Updated for current implementation.
1.7
2008-02-22
SRB
Added edition.
1.6
2007-10-19
SRB
Added keyword.
1.5
2007-01-05
SRB
Reduce emphasis on WordML, add support for OpenOffice.
1.4
2005-11-11
SRB
Added bibliography.
1.3
2005-10-31
SRB
Added mediaobjectco, imageobjectco, programlistingco, areaspec, area, calloutlist.
1.2
2005-10-13
SRB
Version prior to using revhistory.
This document specifies how DocBook elements are mapped to paragraph and character styles in a word processor. The specifications are used to write conversions between DocBook XML and word processor XML formats, such as Microsoft's WordProcessingML (WordML), OpenOffice's OpenDocument and Apple's Pages.
Introduction
Microsoft Word 2003 introduced WordProcessingML (WordML), an XML vocabulary for Word documents. Since then, other popular word processors have become available that use XML as their data representation, namely Apple's Pages and OpenOffice. By converting Word (or OpenOffice or Pages) to XML, it becomes possible to convert a word processing document to DocBook and vice versa using XSL transformations. Such conversions then enable the following.
DocBook content creators write in their familiar wordprocessing application, rather than learning a new XML editing application.
DocBook XML documents can be styled for output using the typesetting features of the word processor.
Word processors have a simple, flat data model; documents consist of paragraphs (and tables) and paragraphs contain text and character spans. All word processors allow styles to be associated with paragraphs and spans.
This specification describes how DocBook elements map to a set of paragraph and character styles. It defines a specific set of style names for which a Word style template can be created. The style names are also used in XSLT template match patterns for conversion. Although originally targetted to MS Word, the system has subsequently been extended to use other word processors, notably Apple's Pages and Open Office.
Project goals
The goal of this project is to enable a word processor, such as, but not limited to, Microsoft Word, to be used with DocBook files. The specific goals include:
Enable authoring of basic DocBook documents in the word processor.
Enable importing of basic DocBook XML documents into the word processor.
To meet these goals, the project provides a toolkit that can be immediately put to use. The kit includes:
Templates for Microsoft Word, Apple Pages and Open Office with formatting styles attached to the style names.
XSLT stylesheets that convert a word processing document that is authored with the corresponding template into a DocBook XML file.
XSLT stylesheets that convert a DocBook document into a word processing document that can be opened in a word processor.
Why basic DocBook?
This project will never be able to support all DocBook elements and structure. Take, for example, the address element. This element can be used both as a block element for metadata. It can also be used as a phrase level element in a block parent, such as the affiliation element. To make matters worse, it can itself contain phrase level markup, such as personname. No word processor allows character styles to be nested.
The project will initially focus on a basic set of commonly used DocBook elements in order to create a useful editing environment that utilises a word processor with DocBook.
One problem facing this conversion project is the sheer number of DocBook elements, over 400 in DocBook 5.0. To support DocBook structural models, several of the elements require more than one paragraph or character style. This would lead to very long and unwieldy list of styles in the word processor interface. That would make authoring less efficient and discourage users.
Accordingly, this project assumes that authors who need the full set of DocBook elements and structures will use an XML authoring tool that better supports them. This project is focused on authors who wish to write basic DocBook documents using a word processor. Because Microsoft Word is so widespread, it is hoped that this project will help a lot of new DocBook users get started with familiar tools. They can then graduate to more advanced tools as their needs develop.
Project Non-Goals
The following goals are not in the scope of this project:
Support of versions of Word that do not feature reading/writing WordML (XML). That is, all versions prior to Word 11 (Office 2003).
Support of arbitrarily defined styles. This system may expect certain styles to be defined in a particular fashion (in particular, those defining the title of components and divisions).
Mapping elements to styles
Although WordML, OpenDocument and DocBook are all XML, there several challenges when trying to convert between them.
The basic problem in mapping paragraph/character styles to DocBook elements is that word processor documents support far less structure than DocBook. DocBook permits nesting of elements within other elements, providing multiple levels of context for each element.
Word's only structural feature is the outlining mode. In Word outlining, certain paragraph styles are assigned outline levels. When a user applies those styles, they effectively create logical structure in the Word document. Unfortunately, Word itself attempts to automatically determine which paragraphs are headings, rendering this method is unreliable.
Instead of relying on Word's built-in outlining mode, this system uses only the names of paragraph styles to determine document structure. Certain heuristics are applied to build the DocBook element structure from the (relatively flat) word processing structure. Titles and other features are used to mark the beginning of a structure and all paragraphs following that are included in that structure until the beginning of the next structure is found. That is, the beginning of one structure marks the end of the previous structure.
Problems may arise when a structure should end, but there is no word processor feature that marks the endpoint. To mark the end of a feature an empty paragraph is used.
Nesting of block elements is another commonly used feature of DocBook. It is not possible to use Word's outline mode for blocks if it is being used for components and sections. So in this specification, nesting of block elements is indicated by adding a number suffix to a style. So a paragraph with style orderedlist2 is considered to be contained within a preceding paragraph with style orderedlist1 or itemizedlist1. Where appropriate in the word processor, paragraph indent levels are used to visually indicate nesting of blocks.
Nesting of inline DocBook elements is particularly difficult to support because word processors do not nest character styles. That means a nested inline would require a separate character style to indicate the parent-child relationship. Given the large number of combinations possible, a prohibitively large number of character styles would have to be created. In this project, nesting of character styles is not supported. Nested inlines being imported from DocBook will be converted to a sequence of single-name character styles, where possible, or rejected.
In many cases, DocBook structure can be derived from the flat sequence of paragraphs based on sibling relationships. For example, when a paragraph styled as para is followed by a paragraph styled as itemizedlist1, the conversion to DocBook will output a para element and then start an itemizedlist element, with the second paragraph as its first listitem. All itemizedlist1 paragraphs that follow without interruption are inserted into the same itemizedlist element.
Some combinations of elements cannot be supported (at least not with the techniques as described in this document). An example is informalexample and its permitted content; there is no title to mark the beginning of the element and no marker for the end of the element, also there are too many parent-child combinations to reasonably define style names.
The design principles used in this project for selecting paragraph/character style names are as follows:
Where Word (or OpenOffice or Pages), by default, has a style or feature that corresponds directly to a DocBook element then that style or feature will be used (and documented in this document). For example, the Normal paragraph style maps to a DocBook para element, and a Word table (w:tbl) maps to a DocBook tableIn some cases Word may posess a feature, but it doesn't function in an acceptable manner. For example, lists. In these cases the feature is to be avoided, and a workaround provided..
Paragraph and character style names will match DocBook element names as much as possible. This will enable authors to learn DocBook element names and help debug problems with conversion.
A style may indicate a parent-child relationship, but the paragraph for such an element may only occur after a paragraph that denotes the beginning of the parent structure. In this case the element name is used as the style name. For example, a personblurb paragraph may only occur after an author, editor or othercontrib paragraph. If a paragraph occurs without the appropriate preceding paragraph, then an error is signalled.
Some styles may also indicate a parent-child relationship, but either the parent structure is ambiguous or the paragraph starts the parent structure. For example, chapter-title indicates that the paragraph is a title element whose DocBook parent is a chapter element.
Some style names are simplified to make them easier to use in the word processor. For example, a paragraph in an orderedlist requires three elements in DocBook: orderedlist, listitem, and para. The paragraph style name in Word is shortened from orderedlist-listitem-para to just orderedlist1 (for a first level list). In the case of lists (see below), the list level is appended, which is why this example becomes orderedlist1.
Style names with a number suffix indicate a nesting level, as described above.
Style names with continue indicate that the paragraph is part of the preceding element. For example, a para paragraph is used for a single paragraph para element. This causes any preceding list to be closed. If a list item in the preceding list is to contain more than one paragraph, then the subsequent paragraphs in the word processor documentmust use the para-continue style.
Character styles map to elements that are children of the element for the paragraph, hence there is no need to encode parent-child relationships. For example, a surname character style in an author paragraph becomes a surname child element of the author element.
Empty paragraph and character styles are ignored. This can be useful to end structures.
The first paragraph style in the word processor document is used to define the root element of the DocBook document. For example, if the document starts with book-title, then the DocBook document will have book element as its root element. All the rest of the document content will be contained in that root element.
Sequential structures are coalesced into a single parent element. For example, a sequence of itemizedlist1 paragraphs becomes a single itemizedlist element with several listitem element children.
DocBook to Paragraph/Character Styles
DocBook element
Style(s)
Comments
Components and sections
book/info/title
book-title
book/info/subtitle
book-subtitle
book/info/titleabbrev
book-titleabbrev
chapter/info/title
chapter-title
Assigned Word outline level 1.
chapter/info/subtitle
chapter-subtitle
chapter/info/titleabbrev
chapter-titleabbrev
appendix/info/title
appendix-title
Assigned Word outline level 1.
preface/info/title
preface-title
Assigned Word outline level 1.
article/info/title
article-title
Assigned Word outline level 1.
article/info/subtitle
article-subtitle
article/info/titleabbrev
article-titleabbrev
bibliography/info/title
bibliography-title
Assigned Word outline level 1.
bibliography/bibliodiv/info/title
bibliodiv-title
biblioentry/title
biblioentry-title
Metadata elements after the biblioentry-title paragraph become part of the biblioentry.
glossary/info/title
glossary-title
Assigned Word outline level 1.
index/info/title
index-title
Assigned Word outline level 1.
part/info/title
part-title
section
Unnumbered section elements are translated into their equivalent numbered paragraph style. Sections 6 levels and deeper are reported as an error.
sect1/info/title
sect1-title
Assigned Word outline level 2.
sect1/info/subtitle
sect1-subtitle
sect2/info/title
sect2-title
Assigned Word outline level 3.
sect2/info/subtitle
sect2-subtitle
sect3/info/title
sect3-title
Assigned Word outline level 4.
sect3/info/subtitle
sect3-subtitle
sect4/info/title
sect4-title
Assigned Word outline level 5.
sect4/info/subtitle
sect4-subtitle
sect5/info/title
sect5-title
Assigned Word outline level 6.
sect5/info/subtitle
sect5-subtitle
simplesect/info/title
simplesect-title
simplesect/info/subtitle
simplesect-subtitle
bridgehead
bridgehead
Metadata elements
abstract/title
abstract-title
.
abstract/para
abstract
affiliation
affiliation
address
address
author
author
date
date
edition
edition
legalnotice
legalnotice
pubdate
pubdate
publisher/pubishername
publisher
publisher/address
publisher-address
revhistory/revision
revision
Block-level elements
para
para, Normal
Any Word paragraph with style Normal will also be converted to a para element.
formalpara/title
formalpara-title
formalpara/para
formalpara
simpara
simpara
note/title
note-title
note/para
note
Consecutive paragraphs with style note after the first note are to be treated as part of the same note element. That is, consecutive notes are coalesced. The note may or may not have a title.
caution/title
caution-title
caution/para
caution
Consecutive cautions are coalesced.
warning/title
warning-title
warning/para
warning
Consecutive warnings are coalesced.
important/title
important-title
important/para
important
Consecutive importants are coalesced.
tip/title
tip-title
tip/para
tip
Consecutive tips are coalesced.
itemizedlist/listitem/para
itemizedlist1
itemizedlist2
itemizedlist3
itemizedlist4
A number suffix indicates a nesting level within other lists.
orderedlist/listitem/para
orderedlist1
orderedlist2
orderedlist3
orderedlist4
listitem/para[position() != 1]
para-continue
This paragraph is included in the immediately preceding listitem.
example/title
example-title
All content following the title is included in the example element. The end of the example content is marked by a caption paragraph or an empty paragraph if there is no caption.
figure/title
figure-title
All content following the title is included in the figure element. Metadata must immediately follow the title. The end of the figure content is marked by a caption paragraph or an empty paragraph if there is no caption.
informalfigure/mediaobject/imageobject/imagedata/@fileref
informalfigure-imagedata, caption
The content of the imageobject-imagedata paragraph is taken as the URI for the image. Metadata may immediately follow the paragraph.
mediaobject/imageobject/imagedata/@fileref
imageobject-imagedata, caption
The content of the imageobject-imagedata paragraph is taken as the URI for the image. May be followed by a caption style paragraph. Metadata may immediately follow the paragraph, before the caption, if any.
table
Word table, caption
table/title
table-title, caption
Metadata may immediately follow the paragraph.
informaltable
Word table
A table with no title imediately preceding it.
caption
caption
literallayout
literallayout
Inside a literallayout paragraph in Word, lines should be separated by line break (Shift-Enter) rather than paragraph break (Enter).
programlisting
programlisting
Inside a programlisting paragraph in Word, lines should be separated by line break (Shift-Enter) rather than paragraph break (Enter). Tabs are not supported.
blockquote/title
blockquote-title
Must immediately precede a blockquote paragraph in Word.
blockquote/para
blockquote
blockquote/attribution
blockquote-attribution
Must immediately follow a blockquote paragraph in Word.
bibliomisc
bibliomisc
Non-DocBook elements
xi:include
xinclude
The content of the paragraph becomes the value of the href attribute.
Inline elements
emphasis
emphasis
emphasis/@role="bold"
emphasis-bold
emphasis/@role="underline"
emphasis-underline
footnote
Word footnote
link
link
In Word, hyperlink properties identify the DocBook linkend.
releaseinfo
releaseinfo
surname
surname
Character style. Must occur in an appropriate parent paragraph, such as author or editor.
firstname
firstname
Character style. Must occur in an appropriate parent paragraph, such as author or editor.
orgname
orgname
keyword
keywordset/keyword
Paragraph style. Consecutive keyword elements are merged into a single keywordset parent element. Words (phrases) within a paragraph separated by commas become individual keyword elements.
citetitle
citetitle
city
city
contrib
contrib
country
country
email
email
fax
fax
honorific
honorific
jobtitle
jobtitle
lineage
lineage
orgdiv
orgdiv
otheraddr
otheraddr
othername
othername
phone
phone
pob
pob
postcode
postcode
shortaffil
shortaffil
state
state
Proposed Additions - not yet implemented
DocBook element
Style(s)
Comments
variablelist/varlistentry/term
variablelist1-term
variablelist2-term
variablelist3-term
variablelist4-term
A variablelist in Word should be a sequence of alternating paragraphs styled as variablelistN-term and variablelistN.
variablelist/varlistentry/listitem/para
variablelist1
variablelist2
variablelist3
variablelist4
Consecutive paragraphs are coalesced.
Attributes
Attributes are a feature of DocBook XML that have no direct counterpart in Word.
XML attributes are encoded in Word comments (annotations). Some dummy text (just a space, using a character style that includes the hidden property) anchors the comment. Within the comment text, character types are used to indicate attribute names and values (these must be paired). This approach keeps the attributes separate to the main body and allows multiple attributes to be encoded.
A disadvantage to this approach is that a paragraph may be related to more than one element, but the attributes are associated with only one element (by default the parent). For example, a section may have an attribute as well as the title child element, but only a single paragraph (with paragraph style sect1-title) represents both elements. Any attribute defined in a comment would be associated with the sect1 element.
Pages does not have annotations, so the character styles attribute-name and attribute-value are used.