Round-Tripping Specifications
    
      Bob
      Stayton
      
        Sagehill Enterprises
      
    
    
      Steve
      Ball
      
        Explain
      
    
    
      
        1.8
        2008-05-22
        SRB
        Updated for current implementation.
      
      
        1.7
        2008-02-22
        SRB
        Added edition.
      
      
        1.6
        2007-10-19
        SRB
        Added keyword.
      
      
        1.5
        2007-01-05
        SRB
        Reduce emphasis on WordML, add support for OpenOffice.
      
      
        1.4
        2005-11-11
        SRB
        Added bibliography.
      
      
        1.3
        2005-10-31
        SRB
        Added mediaobjectco, imageobjectco, programlistingco, areaspec, area, calloutlist.
      
      
        1.2
        2005-10-13
        SRB
        Version prior to using revhistory.
      
    
  
  
    This document specifies how DocBook elements are mapped to paragraph and character styles in a word processor.  The specifications are used to write conversions between DocBook XML and word processor XML formats, such as Microsoft's WordProcessingML (WordML), OpenOffice's OpenDocument and Apple's Pages.
  
  
    Introduction
    Microsoft Word 2003 introduced WordProcessingML (WordML), an XML vocabulary for Word documents.  Since then, other popular word processors have become available that use XML as their data representation, namely Apple's Pages and OpenOffice.  By converting Word (or OpenOffice or Pages) to XML, it becomes possible to convert a word processing document to DocBook and vice versa using XSL transformations. Such conversions then enable the following.
    
      
        DocBook content creators write in their familiar wordprocessing application, rather than learning a new XML editing application.
      
      
        DocBook XML documents can be styled for output using the typesetting features of the word processor.
      
    
    Word processors have a simple, flat data model; documents consist of paragraphs (and tables) and paragraphs contain text and character spans.  All word processors allow styles to be associated with paragraphs and spans.
    This specification describes how DocBook elements map to a set of paragraph and character styles. It defines a specific set of style names for which a Word style template can be created. The style names are also used in XSLT template match patterns for conversion.  Although originally targetted to MS Word, the system has subsequently been extended to use other word processors, notably Apple's Pages and Open Office.
  
  
    Project goals
    The goal of this project is to enable a word processor, such as, but not limited to, Microsoft Word, to be used with DocBook files.  The specific goals include:
    
      
        Enable authoring of basic DocBook documents in the word processor.
      
      
        Enable importing of basic DocBook XML documents into the word processor.
      
    
    To meet these goals, the project provides a toolkit that can be immediately put to use.  The kit includes:
    
      
        Templates for Microsoft Word, Apple Pages and Open Office with formatting styles attached to the style names.
      
      
        XSLT stylesheets that convert a word processing document that is authored with the corresponding template into a DocBook XML file.
      
      
        XSLT stylesheets that convert a DocBook document into a word processing document that can be opened in a word processor.
      
    
    
      Why basic DocBook?
      This project will never be able to support all DocBook elements and structure. Take, for example, the address element. This element can be used both as a block element for metadata. It can also be used as a phrase level element in a block parent, such as the affiliation element. To make matters worse, it can itself contain phrase level markup, such as personname. No word processor allows character styles to be nested.
      The project will initially focus on a basic set of commonly used DocBook elements in order to create a useful editing environment that utilises a word processor with DocBook. 
      One problem facing this conversion project is the sheer number of DocBook elements, over 400 in DocBook 5.0. To support DocBook structural models, several of the elements require more than one paragraph or character style. This would lead to very long and unwieldy list of styles in the word processor interface. That would make authoring less efficient and discourage users.
      Accordingly, this project assumes that authors who need the full set of DocBook elements and structures will use an XML authoring tool that better supports them. This project is focused on authors who wish to write basic DocBook documents using a word processor. Because Microsoft Word is so widespread, it is hoped that this project will help a lot of new DocBook users get started with familiar tools.  They can then graduate to more advanced tools as their needs develop.
    
  
  
    Project Non-Goals
    The following goals are not in the scope of this project:
    
      
        Support of versions of Word that do not feature reading/writing WordML (XML).  That is, all versions prior to Word 11 (Office 2003).
      
      
        Support of arbitrarily defined styles.  This system may expect certain styles to be defined in a particular fashion (in particular, those defining the title of components and divisions).
      
    
  
  
    Mapping elements to styles
    Although WordML, OpenDocument and DocBook are all XML, there several challenges when trying to convert between them.
    The basic problem in mapping paragraph/character styles to DocBook elements is that word processor documents support far less structure than DocBook.  DocBook permits nesting of elements within other elements, providing multiple levels of context for each element.
    Word's only structural feature is the outlining mode. In Word outlining, certain paragraph styles are assigned outline levels.  When a user applies those styles, they effectively create logical structure in the Word document.  Unfortunately, Word itself attempts to automatically determine which paragraphs are headings, rendering this method is unreliable.
    Instead of relying on Word's built-in outlining mode, this system uses only the names of paragraph styles to determine document structure.  Certain heuristics are applied to build the DocBook element structure from the (relatively flat) word processing structure.  Titles and other features are used to mark the beginning of a structure and all paragraphs following that are included in that structure until the beginning of the next structure is found. That is, the beginning of one structure marks the end of the previous structure.
    Problems may arise when a structure should end, but there is no word processor feature that marks the endpoint. To mark the end of a feature an empty paragraph is used.
    Nesting of block elements is another commonly used feature of DocBook.  It is not possible to use Word's outline mode for blocks if it is being used for components and sections.  So in this specification, nesting of block elements is indicated by adding a number suffix to a style. So a paragraph with style orderedlist2 is considered to be contained within a preceding paragraph with style orderedlist1 or itemizedlist1. Where appropriate in the word processor, paragraph indent levels are used to visually indicate nesting of blocks.
    Nesting of inline DocBook elements is particularly difficult to support because word processors do not nest character styles. That means a nested inline would require a separate character style to indicate the parent-child relationship. Given the large number of combinations possible, a prohibitively large number of character styles would have to be created. In this project, nesting of character styles is not supported. Nested inlines being imported from DocBook will be converted to a sequence of single-name character styles, where possible, or rejected.
    In many cases, DocBook structure can be derived from the flat sequence of paragraphs based on sibling relationships. For example, when a paragraph styled as para is followed by a paragraph styled as itemizedlist1, the conversion to DocBook will output a para element and then start an itemizedlist element, with the second paragraph as its first listitem. All itemizedlist1 paragraphs that follow without interruption are inserted into the same itemizedlist element.
    Some combinations of elements cannot be supported (at least not with the techniques as described in this document).  An example is informalexample and its permitted content; there is no title to mark the beginning of the element and no marker for the end of the element, also there are too many parent-child combinations to reasonably define style names.
    The design principles used in this project for selecting paragraph/character style names are as follows:
    
      
        Where Word (or OpenOffice or Pages), by default, has a style or feature that corresponds directly to a DocBook element then that style or feature will be used (and documented in this document).  For example, the Normal paragraph style maps to a DocBook para element, and a Word table (w:tbl) maps to a DocBook tableIn some cases Word may posess a feature, but it doesn't function in an acceptable manner.  For example, lists.  In these cases the feature is to be avoided, and a workaround provided..
      
      
        Paragraph and character style names will match DocBook element names as much as possible. This will enable authors to learn DocBook element names and help debug problems with conversion.
      
      
        A style may indicate a parent-child relationship, but the paragraph for such an element may only occur after a paragraph that denotes the beginning of the parent structure.  In this case the element name is used as the style name.  For example, a personblurb paragraph may only occur after an author, editor or othercontrib paragraph.  If a paragraph occurs without the appropriate preceding paragraph, then an error is signalled.
      
      
        Some styles may also indicate a parent-child relationship, but either the parent structure is ambiguous or the paragraph starts the parent structure.  For example, chapter-title indicates that the paragraph is a title element whose DocBook parent is a chapter element.
      
      
        Some style names are simplified to make them easier to use in the word processor. For example, a paragraph in an orderedlist requires three elements in DocBook: orderedlist, listitem, and para. The paragraph style name in Word is shortened from orderedlist-listitem-para to just orderedlist1 (for a first level list). In the case of lists (see below), the list level is appended, which is why this example becomes orderedlist1.
      
      
        Style names with a number suffix indicate a nesting level, as described above.
      
      
        Style names with continue indicate that the paragraph is part of the preceding element. For example, a para paragraph is used for a single paragraph para element. This causes any preceding list to be closed. If a list item in the preceding list is to contain more than one paragraph, then the subsequent paragraphs in the word processor documentmust use the para-continue style.
      
      
        Character styles map to elements that are children of the element for the paragraph, hence there is no need to encode parent-child relationships.  For example, a surname character style in an author paragraph becomes a surname child element of the author element.
      
      
        Empty paragraph and character styles are ignored. This can be useful to end structures.
      
      
        The first paragraph style in the word processor document is used to define the root element of the DocBook document. For example, if the document starts with book-title, then the DocBook document will have book element as its root element.  All the rest of the document content will be contained in that root element.
      
    
    Sequential structures are coalesced into a single parent element.  For example, a sequence of itemizedlist1 paragraphs becomes a single itemizedlist element with several listitem element children.
    
      DocBook to Paragraph/Character Styles
      
        
        
        
        
          
            
              DocBook element
            
            
              Style(s)
            
            
              Comments
            
          
        
        
          
            
              
                Components and sections
              
            
          
          
            
              book/info/title
            
            
              book-title
            
            
              
            
          
          
            
              book/info/subtitle
            
            
              book-subtitle
            
            
              
            
          
          
            
              book/info/titleabbrev
            
            
              book-titleabbrev
            
            
              
            
          
          
            
              chapter/info/title
            
            
              chapter-title
            
            
              Assigned Word outline level 1.
            
          
          
            
              chapter/info/subtitle
            
            
              chapter-subtitle
            
            
              
            
          
          
            
              chapter/info/titleabbrev
            
            
              chapter-titleabbrev
            
            
              
            
          
          
            
              appendix/info/title
            
            
              appendix-title
            
            
              Assigned Word outline level 1.
            
          
          
            
              preface/info/title
            
            
              preface-title
            
            
              Assigned Word outline level 1.
            
          
          
            
              article/info/title
            
            
              article-title
            
            
              Assigned Word outline level 1.
            
          
          
            
              article/info/subtitle
            
            
              article-subtitle
            
            
              
            
          
          
            
              article/info/titleabbrev
            
            
              article-titleabbrev
            
            
              
            
          
          
            
              bibliography/info/title
            
            
              bibliography-title
            
            
              Assigned Word outline level 1.
            
          
          
            
              bibliography/bibliodiv/info/title
            
            
              bibliodiv-title
            
            
              
            
          
          
            
              biblioentry/title
            
            
              biblioentry-title
            
            
              Metadata elements after the biblioentry-title paragraph become part of the biblioentry.
            
          
          
            
              glossary/info/title
            
            
              glossary-title
            
            
              Assigned Word outline level 1.
            
          
          
            
              index/info/title
            
            
              index-title
            
            
              Assigned Word outline level 1.
            
          
          
            
              part/info/title
            
            
              part-title
            
            
              
            
          
          
            
              section
            
            
              
            
            
              Unnumbered section elements are translated into their equivalent numbered paragraph style. Sections 6 levels and deeper are reported as an error.
            
          
          
            
              sect1/info/title
            
            
              sect1-title
            
            
              Assigned Word outline level 2.
            
          
          
            
              sect1/info/subtitle
            
            
              sect1-subtitle
            
            
              
            
          
          
            
              sect2/info/title
            
            
              sect2-title
            
            
              Assigned Word outline level 3.
            
          
          
            
              sect2/info/subtitle
            
            
              sect2-subtitle
            
            
              
            
          
          
            
              sect3/info/title
            
            
              sect3-title
            
            
              Assigned Word outline level 4.
            
          
          
            
              sect3/info/subtitle
            
            
              sect3-subtitle
            
            
              
            
          
          
            
              sect4/info/title
            
            
              sect4-title
            
            
              Assigned Word outline level 5.
            
          
          
            
              sect4/info/subtitle
            
            
              sect4-subtitle
            
            
              
            
          
          
            
              sect5/info/title
            
            
              sect5-title
            
            
              Assigned Word outline level 6.
            
          
          
            
              sect5/info/subtitle
            
            
              sect5-subtitle
            
            
              
            
          
          
            
              simplesect/info/title
            
            
              simplesect-title
            
            
              
            
          
          
            
              simplesect/info/subtitle
            
            
              simplesect-subtitle
            
            
              
            
          
          
            
              bridgehead
            
            
              bridgehead
            
            
              
            
          
          
            
              
                Metadata elements
              
            
          
          
            
              abstract/title
            
            
              abstract-title
            
            .
          
          
            
              abstract/para
            
            
              abstract
            
            
              
            
          
          
            
              affiliation
            
            
              affiliation
            
            
              
            
          
          
            
              address
            
            
              address
            
            
              
            
          
          
            
              author
            
            
              author
            
            
              
            
          
          
            
              date
            
            
              date
            
            
              
            
          
          
            
              edition
            
            
              edition
            
            
              
            
          
          
            
              legalnotice
            
            
              legalnotice
            
            
              
            
          
          
            
              pubdate
            
            
              pubdate
            
            
              
            
          
          
            
              publisher/pubishername
            
            
              publisher
            
            
              
            
          
          
            
              publisher/address
            
            
              publisher-address
            
            
              
            
          
          
            
              revhistory/revision
            
            
              revision
            
            
              
            
          
          
            
              
                Block-level elements
              
            
          
          
            
              para
            
            
              para, Normal
            
            
              Any Word paragraph with style Normal will also be converted to a para element.
            
          
          
            
              formalpara/title
            
            
              formalpara-title
            
            
              
            
          
          
            
              formalpara/para
            
            
              formalpara
            
            
              
            
          
          
            
              simpara
            
            
              simpara
            
            
              
            
          
          
            
              note/title
            
            
              note-title
            
            
              
            
          
          
            
              note/para
            
            
              note
            
            
              Consecutive paragraphs with style note after the first note are to be treated as part of the same note element.  That is, consecutive notes are coalesced. The note may or may not have a title.
            
          
          
            
              caution/title
            
            
              caution-title
            
            
              
            
          
          
            
              caution/para
            
            
              caution
            
            
              Consecutive cautions are coalesced.
            
          
          
            
              warning/title
            
            
              warning-title
            
            
              
            
          
          
            
              warning/para
            
            
              warning
            
            
              Consecutive warnings are coalesced.
            
          
          
            
              important/title
            
            
              important-title
            
            
              
            
          
          
            
              important/para
            
            
              important
            
            
              Consecutive importants are coalesced.
            
          
          
            
              tip/title
            
            
              tip-title
            
            
              
            
          
          
            
              tip/para
            
            
              tip
            
            
              Consecutive tips are coalesced.
            
          
          
            
              itemizedlist/listitem/para
            
            
              
                itemizedlist1
itemizedlist2
itemizedlist3
itemizedlist4
              
            
            
              A number suffix indicates a nesting level within other lists.
            
          
          
            
              orderedlist/listitem/para
            
            
              
                orderedlist1
orderedlist2
orderedlist3
orderedlist4
              
            
            
              
            
          
          
            
              listitem/para[position() != 1]
            
            
              para-continue
            
            
              This paragraph is included in the immediately preceding listitem.
            
          
          
            
              example/title
            
            
              example-title
            
            
              All content following the title is included in the example element. The end of the example content is marked by a caption paragraph or an empty paragraph if there is no caption.
            
          
          
            
              figure/title
            
            
              figure-title
            
            
              All content following the title is included in the figure element. Metadata must immediately follow the title. The end of the figure content is marked by a caption paragraph or an empty paragraph if there is no caption.
            
          
          
            
              informalfigure/mediaobject/imageobject/imagedata/@fileref
            
            
              informalfigure-imagedata, caption
            
            
              The content of the imageobject-imagedata paragraph is taken as the URI for the image. Metadata may immediately follow the paragraph.
            
          
          
            
              mediaobject/imageobject/imagedata/@fileref
            
            
              imageobject-imagedata, caption
            
            
              The content of the imageobject-imagedata paragraph is taken as the URI for the image.  May be followed by a caption style paragraph. Metadata may immediately follow the paragraph, before the caption, if any.
            
          
          
            
              table
            
            
              Word table, caption
            
            
              
            
          
          
            
              table/title
            
            
              table-title, caption
            
            
              Metadata may immediately follow the paragraph.
            
          
          
            
              informaltable
            
            
              Word table
            
            
              A table with no title imediately preceding it.
            
          
          
            
              caption
            
            
              caption
            
            
              
            
          
          
            
              literallayout
            
            
              literallayout
            
            
              Inside a literallayout paragraph in Word, lines should be separated by line break (Shift-Enter) rather than paragraph break (Enter).
            
          
          
            
              programlisting
            
            
              programlisting
            
            
              Inside a programlisting paragraph in Word, lines should be separated by line break (Shift-Enter) rather than paragraph break (Enter). Tabs are not supported.
            
          
          
            
              blockquote/title
            
            
              blockquote-title
            
            
              Must immediately precede a blockquote paragraph in Word.
            
          
          
            
              blockquote/para
            
            
              blockquote
            
            
              
            
          
          
            
              blockquote/attribution
            
            
              blockquote-attribution
            
            
              Must immediately follow a blockquote paragraph in Word.
            
          
          
            
              bibliomisc
            
            
              bibliomisc
            
            
              
            
          
          
            
              
                Non-DocBook elements
              
            
          
          
            
              xi:include
            
            
              xinclude
            
            
              The content of the paragraph becomes the value of the href attribute.
            
          
          
            
              
                Inline elements
              
            
          
          
            
              emphasis
            
            
              emphasis
            
            
              
            
          
          
            
              emphasis/@role="bold"
            
            
              emphasis-bold
            
            
              
            
          
          
            
              emphasis/@role="underline"
            
            
              emphasis-underline
            
            
              
            
          
          
            
              footnote
            
            
              Word footnote
            
            
              
            
          
          
            
              link
            
            
              link
            
            
              In Word, hyperlink properties identify the DocBook linkend.
            
          
          
            
              releaseinfo
            
            
              releaseinfo
            
            
              
            
          
          
            
              surname
            
            
              surname
            
            
              Character style.  Must occur in an appropriate parent paragraph, such as author or editor.
            
          
          
            
              firstname
            
            
              firstname
            
            
              Character style.  Must occur in an appropriate parent paragraph, such as author or editor.
            
          
          
            
              orgname
            
            
              orgname
            
            
              
            
          
          
            
              keyword
            
            
              keywordset/keyword
            
            
              Paragraph style.  Consecutive keyword elements are merged into a single keywordset parent element.  Words (phrases) within a paragraph separated by commas become individual keyword elements.
            
          
          
            
              citetitle
            
            
              citetitle
            
            
              
            
          
          
            
              city
            
            
              city
            
            
              
            
          
          
            
              contrib
            
            
              contrib
            
            
              
            
          
          
            
              country
            
            
              country
            
            
              
            
          
          
            
              email
            
            
              email
            
            
              
            
          
          
            
              fax
            
            
              fax
            
            
              
            
          
          
            
              honorific
            
            
              honorific
            
            
              
            
          
          
            
              jobtitle
            
            
              jobtitle
            
            
              
            
          
          
            
              lineage
            
            
              lineage
            
            
              
            
          
          
            
              orgdiv
            
            
              orgdiv
            
            
              
            
          
          
            
              otheraddr
            
            
              otheraddr
            
            
              
            
          
          
            
              othername
            
            
              othername
            
            
              
            
          
          
            
              phone
            
            
              phone
            
            
              
            
          
          
            
              pob
            
            
              pob
            
            
              
            
          
          
            
              postcode
            
            
              postcode
            
            
              
            
          
          
            
              shortaffil
            
            
              shortaffil
            
            
              
            
          
          
            
              state
            
            
              state
            
            
              
            
          
        
      
    
    
      Proposed Additions - not yet implemented
      
        
        
        
        
          
            
              DocBook element
            
            
              Style(s)
            
            
              Comments
            
          
        
        
          
            
              variablelist/varlistentry/term
            
            
              
                variablelist1-term
variablelist2-term
variablelist3-term
variablelist4-term
              
            
            
              A variablelist in Word should be a sequence of alternating paragraphs styled as variablelistN-term and variablelistN.
            
          
          
            
              variablelist/varlistentry/listitem/para
            
            
              
                variablelist1
variablelist2
variablelist3
variablelist4
              
            
            
              Consecutive paragraphs are coalesced.
            
          
        
      
    
    
      Attributes
      Attributes are a feature of DocBook XML that have no direct counterpart in Word.
      XML attributes are encoded in Word comments (annotations).  Some dummy text (just a space, using a character style that includes the hidden property) anchors the comment.  Within the comment text, character types are used to indicate attribute names and values (these must be paired).  This approach keeps the attributes separate to the main body and allows multiple attributes to be encoded.
      A disadvantage to this approach is that a paragraph may be related to more than one element, but the attributes are associated with only one element (by default the parent).  For example, a section may have an attribute as well as the title child element, but only a single paragraph (with paragraph style sect1-title) represents both elements.  Any attribute defined in a comment would be associated with the sect1 element.
      Pages does not have annotations, so the character styles attribute-name and attribute-value are used.