Rethinking OOXML Validation, Part 1

ODF Plugfest Venue
Brussels ODF Plugfest venue

At the recent ODF Plugfest in Brussels, I was very interested to hear Jos van den Oever of KOffice present on how ODF’s alternative “flat” document format could be used to drive browser based rendering of ODF documents. ODF defines two methods of serializing documents: one uses multiple files in a “Zip” archive, the aforementioned “flat” format combines everything into a single XML file. Seeing this approach in action gelled with some thoughts I’d been having on how better to validate OOXML documents using standards-based XML tools …

Unlike ODF, OOXML has no “flat” file format – its files are OPC packages built on top of Zip archives. However, some interesting work has already been done in this area by Microsoft’s Eric White in such as blog posts as The Flat OPC Format, which points out that Microsoft Word™ (alone among the Office™ suite members [UPDATE: Word and PowerPoint can do this]) can already save in an unofficial flat format which can be processed with standards-based XML tools like XSLT processors.

Rather than having to rely on Word, or stick only to word processing documents, I thought it would be interesting to explore ways in which any OOXML document could be flattened and processed using standards-based processors. Ideally one would then also write a tool that did the opposite so that to work with OOXML content the steps would be first to flatten it, then to do the processing, and then to re-structify it into an OPC package.

Back to XProc

I have already written a number of blog posts on office document validation, and have used a variety of technical approaches to get the validation done. Most of my recent effort has been on developing the Office-o-tron, a hand-crafted Java application which functions primarily by unpacking archives to the file system before operating on their individual components. Earlier efforts using XProc has foundered on the difficulty of working with files inside a Zip archive — in particular because I was using the non-standard JAR URI scheme which, it turns out, is not capable of addressing items with certain names (e.g. “Object 1”) that one typically finds inside ODF documents.

However, armed with knowledge gained from developing Office-o-tron, and looking again at Zip handling extension functions of the Calabash XProc processor, made me think there was a way XProc could be used to get the job done. Here’s how …

Inspecting an OPC package

OOXML documents are built using the Open Packaging Convention (OPC, or ISO/IEC 29500-2), a generic means of building file formats within Zip archives which also happens to underpin the XPS format. OPC’s chief virtue – that it is very generic – is offset by much (probably too much) complexity in pursuit of this goal. Before we can know what we’ve got in an OPC package, and how to process it, some work needs to be done.

Fortunately, the essence of what we need consists of two pieces of information: a file inside the Zip guaranteed to be called “[Content_Types].xml”, and a manifest of the content of the package. XProc can get both of these pieces of information for us:

<?xml version="1.0"?>
<p:pipeline name="consolidate-officedoc"
  xmlns:p="http://www.w3.org/ns/xproc"
  xmlns:c="http://www.w3.org/ns/xproc-step"
  xmlns:cx="http://xmlcalabash.com/ns/extensions"
  xmlns:xo="http://xmlopen.org/officecert"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  version="1.0">

  <p:import href="extensions.xpl"/>

  <!-- specifies the document to be processed -->
  <p:option name="package-sysid" required="true"/>


  <!--
  
  Given the system identifer $package-sysid of an OOXML document,
  this pipeline returns a document whose root element is archive-info
  which contains two children: the [Content_Types].xml resource
  contained in the root of the archive, and a zipfile element
  created per the unzip step at:
  
  http://xmlcalabash.com/extension/steps/library-1.0.xpl
  
  -->
  <p:pipeline type="xo:archive-info">

    <p:option name="package-sysid" required="true"/>

    <cx:unzip name="content-types" file="[Content_Types].xml">
      <p:with-option name="href" select="$package-sysid"/>
    </cx:unzip>

    <cx:unzip name="archive-content">
      <p:with-option name="href" select="$package-sysid"/>
    </cx:unzip>

    <p:sink/>

    <p:wrap-sequence wrapper="archive-info">
      <p:input port="source">
        <p:pipe step="content-types" port="result"/>
        <p:pipe step="archive-content" port="result"/>
      </p:input>
    </p:wrap-sequence>

  </p:pipeline>

  <!-- get the type information and content of the package -->
  <xo:archive-info>
    <p:with-option name="package-sysid" select="$package-sysid"/>
  </xo:archive-info>

  <!-- etc -->

Executing this pipeline on a typical “HelloWorld.docx” file gives us an XML document which consists of a composite of our two vital pieces of information, as follows:

<archive-info>
  <Types xmlns="http://schemas.openxmlformats.org/package/2006/content-types">
    <Override PartName="/word/comments.xml"
      ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.comments+xml"/>
    <Default Extension="rels" ContentType="application/vnd.openxmlformats-package.relationships+xml"/>
    <Default Extension="xml" ContentType="application/xml"/>
    <Override PartName="/word/document.xml"
      ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml"/>
    <Override PartName="/word/styles.xml"
      ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.styles+xml"/>
    <Override PartName="/docProps/app.xml"
      ContentType="application/vnd.openxmlformats-officedocument.extended-properties+xml"/>
    <Override PartName="/word/settings.xml"
      ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.settings+xml"/>
    <Override PartName="/word/theme/theme1.xml"
      ContentType="application/vnd.openxmlformats-officedocument.theme+xml"/>
    <Override PartName="/word/fontTable.xml"
      ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.fontTable+xml"/>
    <Override PartName="/word/webSettings.xml"
      ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.webSettings+xml"/>
    <Override PartName="/docProps/core.xml"
      ContentType="application/vnd.openxmlformats-package.core-properties+xml"/>
  </Types>
  <c:zipfile href="file:/C:/work/officecert/hello.docx">
    <c:file compressed-size="368" size="712" name="docProps/app.xml" date="1980-01-01T00:00:00.000Z"/>
    <c:file compressed-size="375" size="747" name="docProps/core.xml"
      date="1980-01-01T00:00:00.000Z"/>
    <c:file compressed-size="459" size="1004" name="word/comments.xml"
      date="1980-01-01T00:00:00.000Z"/>
    <c:file compressed-size="539" size="1218" name="word/document.xml"
      date="1980-01-01T00:00:00.000Z"/>
    <c:file compressed-size="407" size="1296" name="word/fontTable.xml"
      date="1980-01-01T00:00:00.000Z"/>
    <c:file compressed-size="651" size="1443" name="word/settings.xml"
      date="1980-01-01T00:00:00.000Z"/>
    <c:file compressed-size="1783" size="16891" name="word/styles.xml"
      date="2009-05-25T14:15:08.000+01:00"/>
    <c:file compressed-size="1686" size="6992" name="word/theme/theme1.xml"
      date="1980-01-01T00:00:00.000Z"/>
    <c:file compressed-size="187" size="260" name="word/webSettings.xml"
      date="1980-01-01T00:00:00.000Z"/>
    <c:file compressed-size="265" size="948" name="word/_rels/document.xml.rels"
      date="1980-01-01T00:00:00.000Z"/>
    <c:file compressed-size="372" size="1443" name="[Content_Types].xml"
      date="1980-01-01T00:00:00.000Z"/>
    <c:file compressed-size="243" size="590" name="_rels/.rels" date="1980-01-01T00:00:00.000Z"/>
  </c:zipfile>
</archive-info>

The purpose of the information in the Types element is to tell us the MIME types of the contents of the package, either specifically (in Override elements), or indirectly by associating a MIME type with file extensions (in Default elements). What we are now going to do is add another step to our pipeline that resolves all this information so that we label each of the items in the Zip file with the MIME type that applies to it.

 <p:xslt>
    <p:input port="stylesheet">
      <p:inline>
        <xsl:stylesheet version="2.0"
          xmlns:opc="http://schemas.openxmlformats.org/package/2006/content-types">

          <xsl:variable name="ooxml-mappings" select="document('ooxml-map.xml')"/>

          <xsl:template match="/">
            <c:zipfile>
              <xsl:copy-of select="/archive-info/c:zipfile/@*"/>
              <xsl:apply-templates/>
            </c:zipfile>
          </xsl:template>

          <xsl:template match="c:file">
            <xsl:variable name="entry-name" select="@name"/>
            <xsl:variable name="toks" select="tokenize($entry-name,'\.')"/>
            <xsl:variable name="ext" select="$toks[count($toks)]"/>
            <c:file>
              <xsl:copy-of select="@name"/>
              <xsl:variable name="overriden-type"
                select="//opc:Override[ends-with(@PartName,$entry-name)]/@ContentType"/>
              <xsl:variable name="default-type"
                select="//opc:Default[ends-with(@Extension,$ext)]/@ContentType"/>
              <xsl:variable name="resolved-type"
                select="if(string-length($overriden-type)) then $overriden-type else $default-type"/>
              <xsl:attribute name="resolved-type" select="$resolved-type"/>
              <xsl:attribute name="schema"
                select="$ooxml-mappings//mapping[mime-type=$resolved-type]/schema-name"/>
              <expand name="{@name}"/>
            </c:file>
          </xsl:template>

        </xsl:stylesheet>
      </p:inline>
    </p:input>
  </p:xslt>

You’ll notice I am also using an XML document called “ooxml-map.xml” as part of this enrinchment process. This is a file which contains the (hard won) information about which document of which MIME types are governed by which schemas as published as part of the OOXML standard. That document is available online here.

The result of running this additional step is to give us an enriched manifest of the OPC package content:

<c:zipfile xmlns:c="http://www.w3.org/ns/xproc-step"
  xmlns:cx="http://xmlcalabash.com/ns/extensions"
  xmlns:xo="http://xmlopen.org/officecert"
  xmlns:opc="http://schemas.openxmlformats.org/package/2006/content-types"
  href="file:/C:/work/officecert/hello.docx">
  <c:file name="docProps/app.xml"
    resolved-type="application/vnd.openxmlformats-officedocument.extended-properties+xml"
    schema="shared-documentPropertiesExtended.xsd">
    <expand name="docProps/app.xml"/>
  </c:file>
  <c:file name="docProps/core.xml"
    resolved-type="application/vnd.openxmlformats-package.core-properties+xml"
    schema="opc-coreProperties.xsd">
    <expand name="docProps/core.xml"/>
  </c:file>
  <c:file name="word/comments.xml"
    resolved-type="application/vnd.openxmlformats-officedocument.wordprocessingml.comments+xml"
    schema="wml.xsd">
    <expand name="word/comments.xml"/>
  </c:file>
  <c:file name="word/document.xml"
    resolved-type="application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml"
    schema="wml.xsd">
    <expand name="word/document.xml"/>
  </c:file>
  <c:file name="word/fontTable.xml"
    resolved-type="application/vnd.openxmlformats-officedocument.wordprocessingml.fontTable+xml"
    schema="wml.xsd">
    <expand name="word/fontTable.xml"/>
  </c:file>
  <c:file name="word/settings.xml"
    resolved-type="application/vnd.openxmlformats-officedocument.wordprocessingml.settings+xml"
    schema="wml.xsd">
    <expand name="word/settings.xml"/>
  </c:file>
  <c:file name="word/styles.xml"
    resolved-type="application/vnd.openxmlformats-officedocument.wordprocessingml.styles+xml"
    schema="wml.xsd">
    <expand name="word/styles.xml"/>
  </c:file>
  <c:file name="word/theme/theme1.xml"
    resolved-type="application/vnd.openxmlformats-officedocument.theme+xml"
    schema="dml-main.xsd">
    <expand name="word/theme/theme1.xml"/>
  </c:file>
  <c:file name="word/webSettings.xml"
    resolved-type="application/vnd.openxmlformats-officedocument.wordprocessingml.webSettings+xml"
    schema="wml.xsd">
    <expand name="word/webSettings.xml"/>
  </c:file>
  <c:file name="word/_rels/document.xml.rels"
    resolved-type="application/vnd.openxmlformats-package.relationships+xml"
    schema="">
    <expand name="word/_rels/document.xml.rels"/>
  </c:file>
  <c:file name="[Content_Types].xml" resolved-type="application/xml" schema="">
    <expand name="[Content_Types].xml"/>
  </c:file>
  <c:file name="_rels/.rels"
    resolved-type="application/vnd.openxmlformats-package.relationships+xml"
    schema="">
    <expand name="_rels/.rels"/>
  </c:file>
</c:zipfile>

Also notice that each of the items has been given a child element called expand – this is a placeholder for the documents which we are going to expand in situ to create our flat representation of the OPC package content. The pipeline step to achieve that expansion is quite straightforward:

  <p:viewport name="archive-content" match="c:file[contains(@resolved-type,'xml')]/expand">
    <p:variable name="filename" select="/*/@name"/>
    <cx:unzip>
      <p:with-option name="href" select="$package-sysid"/>
      <p:with-option name="file" select="$filename"/>
    </cx:unzip>
  </p:viewport>

At this point, we're only expanding the content that looks like it is XML – a fuller implementation would expand non-XML content and BASE64 encode it (perfectly doable with XProc).

The result of applying this process is a rather large document, with all the expand elements referring to XML documents replaced by that XML document content … in other words, a flat OPC file. With the additional metadata we have placed on the containing c:file elements, we have enough information to start performing schema validation. I will look at validation in more depth in the next part of this post …

Comments (8) -

  • Chris Rae

    11/5/2010 1:43:20 AM |

    I like this idea - the only thing I wonder about on the down side is that it may take a bit longer to declare invalid files as invalid, because of the stripdown/reassembly step.

  • Alex Hudson

    11/5/2010 4:05:00 AM |

    I've never really understood the ODF "flat file" format. Last time I looked at the spec., there didn't seem to be an associated MIME type, file extension, or any discussion on how to use it. I've tried to feed it to OpenOffice.org a few times and failed.

    Is there really anything to it? I mean, you could just do <files><file name="content.xml">...</file><file name="styles.xml">..</file></files> or something and get all the data in one file; which is partly useful - but without applications actually opening that kind of data consistently, it's a bit useless....

  • Alex

    11/5/2010 5:48:26 PM |

    @Chris

    Yes, a little bit longer; though the more significant performance hit will come when we need to build an in-memory representation of our flat documents. Fortunately the state-of-the-art in XML processing is progessing hand-in-hand with modern machinery which is up to this kind of task ...

    @Alex H

    Yes, I get the impression the ODF flat format is a bit of a Cinderella technology, and I haven't used it in anger. The big advantage of a flat format (as we will see) for validation is that cross-document consistency checking becomes a lot simpler using off-the-shelf validation technlogies ... I'm not sure it has a big particular use in day-to-day document handling.

    - Alex.

  • Wu MingShi

    11/5/2010 8:12:10 PM |

    Dear Alex,

    I am sure you had considered what a naive person like me would had done, i.e. simply put all XML files inside a big list container, say, filelist, e.g.,

    <filelist>
    <file name='fileA'> <!-- content of fileA --></file>
    <file name='fileB'> <!-- content of fileB --></file>
    </filelist>

    Why had you discounted it?

  • Alex

    11/5/2010 8:40:59 PM |

    @Wu MingShi

    Yes, that's more or less what I am doing! The challenge is (1) how to get from a ZIP file to this document and (2) how to work out the type each file claims to be, so we can later test it for conformance ...

    - Alex.

    • Wu MingShi

      11/18/2010 9:27:55 PM |

      Alex,

      Thanks for the reply. Unfortunately, as English is not my first language, I might have given the impression that I am not as naive as I actually am. Wink

      I can see where you are heading with 'c:file' tags. I thought we could get away with type information. The reason is in the OPC zip package, we already store all information we need (and you had mentioned in your post) somewhere, surely merely flatten the file into one big xml would had captured the information as well and any validation procedure could pick those information as need be inside c:zipfile?

      Of course, I am not saying my naive simple zip file content dump into one big XML is easy to validate since critical information to validate one file might be spread across different tags. It also means one might have to read child tag just to validate the parent tag and other messy or nasty stuff. Is this what you are trying to avoid?

      Thanks
      Wu MingShi

  • Eric White

    11/6/2010 1:55:50 AM |

    Hi Alex,

    Minor point - PowerPoint also has the same ability to save in Flat OPC.

    I wonder why not, if going to the trouble to flatten it, you don't just use the Flat OPC schema, which contains everything necessary to reconstitute into an OPC.

    -Eric

    • Alex

      11/6/2010 4:34:44 AM |

      @Eric hi

      Thanks for the correction - I have updated the text to reflect this.

      I didn't know there was a flat schema - where can I find it? I'm not that bothered by the names of the wrapper elements, though, and it may be that for validation purposes it’s going to prove useful to store some additional information around the content than a non-specialised schema would allow for ...

Comments are closed