Beedle dee, dee dee dee, two formats [updated]

Punch & Judy

Two pieces of recent news offer an interesting commentary on the continuing evolution of office document formats and applications.

In Germany, following last year’s pained presentation on its attempts to adopt ODF, the city of Freiburg, along with a number of other European public administrations, has been funding an open source project to improve OpenOffice’s support for the OOXML format — the goal being to advance the case that users can switch from MS Office suite to an open source alternative, confident in the ability to interoperate between the two. The project is bearing fruit, with its first results claiming to have fixed “three of five biggest OOXML support issues”.

The news is timely with – it seems – a vote imminent in Freiburg’s city council to decide whether to “end its floundering migration of OpenOffice and to stop using the Open Document Format” … it will be interesting to see how this plays out.

[Update 2012-11-21: It appears Freiburg has voted to abandon its plan to migrate to OpenOffice (and for them it seems this means ODF too)].

Meanwhile, in Portugal …

But ODF advocates need not despair. While in Freiburg it might seem that ODF is the boat anchor holding FOSS office suites back, a thousand miles away in Portugal the Portuguese Government has published a list that mandates a number of open standards to be adopted in the Portuguese public administration. ODF is there; OOXML is not. The only wrinkle is that the version of ODF specified is 1.1 — which, interestingly, is the very version of ODF that MS Office currently happens to support.

Double the fun?

This raises the intriguing possibility that in one part of Europe officials may be using FOSS office suites to work with OOXML documents, while elsewhere MS Office will be used to work with ODF documents. If nothing else, this kind of thing is likely to accelerate the demand from users for developers and standardizers to address the remaining areas where format interoperability remains less than clear-cut — but in this scenario I’d also have to say I’d feel sorry for the users – especially those working on more complex documents – where those “less than clear-cut” areas are likely to be all-too apparent …

Australia and OOXML

Somewhere too early


There have been some poor decisions of late in Australia. Not playing Hauritz and persisting too long with the out-of-form Clarke and Ponting probably cost Australia the Ashes and has led to terrible self-flagelation. While it’s generally not done to take pleasure in the discomfort of others, I do think an exception can be made in the case of the Australian cricket team.

From various recent blogs and tweets I’ve noticed a fuss surrounding the decision by the Australian Government Information Management Office (AGIMO) to recommend the use of OOXML as a document format, and from the tenor of the comments it would seem this is being treated as similar calamity for Australia. However, there appears to be some misunderstanding and misinformation flying around which is worth a comment …

Leaving aside the merits of the decision itself, one particular theme in the commentary is that AGIMO have somehow picked a “non-ISO” version of OOXML. I can’t find any evidence of this. By specifying Ecma 376 without an edition number the convention is that the latest version of that standard is intended; and though I do think there is a danger of over-reading this particular citation, the current version of Ecma 376 is the second edition, which is the version of OOXML that was approved by ISO and IEC members in April 2008. The Ecma and ISO/IEC versions are in lock-step, with the Ecma text only ever mirroring the ISO/IEC text. And although (as now) there are inevitably some bureaucratic and administrative delays in the Ecma version rolling in all changes made in JTC 1 prior to publication, to cite one is, effectively, equivalent to citing the other.

[UPDATE: John Sheridan from AGIMO comments below that Ecma 376 1st Edition was intended, and I respond]

Rethinking OOXML Validation, Part 1

ODF Plugfest Venue
Brussels ODF Plugfest venue

At the recent ODF Plugfest in Brussels, I was very interested to hear Jos van den Oever of KOffice present on how ODF’s alternative “flat” document format could be used to drive browser based rendering of ODF documents. ODF defines two methods of serializing documents: one uses multiple files in a “Zip” archive, the aforementioned “flat” format combines everything into a single XML file. Seeing this approach in action gelled with some thoughts I’d been having on how better to validate OOXML documents using standards-based XML tools …

Unlike ODF, OOXML has no “flat” file format – its files are OPC packages built on top of Zip archives. However, some interesting work has already been done in this area by Microsoft’s Eric White in such as blog posts as The Flat OPC Format, which points out that Microsoft Word™ (alone among the Office™ suite members [UPDATE: Word and PowerPoint can do this]) can already save in an unofficial flat format which can be processed with standards-based XML tools like XSLT processors.

Rather than having to rely on Word, or stick only to word processing documents, I thought it would be interesting to explore ways in which any OOXML document could be flattened and processed using standards-based processors. Ideally one would then also write a tool that did the opposite so that to work with OOXML content the steps would be first to flatten it, then to do the processing, and then to re-structify it into an OPC package.

Back to XProc

I have already written a number of blog posts on office document validation, and have used a variety of technical approaches to get the validation done. Most of my recent effort has been on developing the Office-o-tron, a hand-crafted Java application which functions primarily by unpacking archives to the file system before operating on their individual components. Earlier efforts using XProc has foundered on the difficulty of working with files inside a Zip archive — in particular because I was using the non-standard JAR URI scheme which, it turns out, is not capable of addressing items with certain names (e.g. “Object 1”) that one typically finds inside ODF documents.

However, armed with knowledge gained from developing Office-o-tron, and looking again at Zip handling extension functions of the Calabash XProc processor, made me think there was a way XProc could be used to get the job done. Here’s how …

Inspecting an OPC package

OOXML documents are built using the Open Packaging Convention (OPC, or ISO/IEC 29500-2), a generic means of building file formats within Zip archives which also happens to underpin the XPS format. OPC’s chief virtue – that it is very generic – is offset by much (probably too much) complexity in pursuit of this goal. Before we can know what we’ve got in an OPC package, and how to process it, some work needs to be done.

Fortunately, the essence of what we need consists of two pieces of information: a file inside the Zip guaranteed to be called “[Content_Types].xml”, and a manifest of the content of the package. XProc can get both of these pieces of information for us:

<?xml version="1.0"?>
<p:pipeline name="consolidate-officedoc"

  <p:import href="extensions.xpl"/>

  <!-- specifies the document to be processed -->
  <p:option name="package-sysid" required="true"/>

  Given the system identifer $package-sysid of an OOXML document,
  this pipeline returns a document whose root element is archive-info
  which contains two children: the [Content_Types].xml resource
  contained in the root of the archive, and a zipfile element
  created per the unzip step at:
  <p:pipeline type="xo:archive-info">

    <p:option name="package-sysid" required="true"/>

    <cx:unzip name="content-types" file="[Content_Types].xml">
      <p:with-option name="href" select="$package-sysid"/>

    <cx:unzip name="archive-content">
      <p:with-option name="href" select="$package-sysid"/>


    <p:wrap-sequence wrapper="archive-info">
      <p:input port="source">
        <p:pipe step="content-types" port="result"/>
        <p:pipe step="archive-content" port="result"/>


  <!-- get the type information and content of the package -->
    <p:with-option name="package-sysid" select="$package-sysid"/>

  <!-- etc -->

Executing this pipeline on a typical “HelloWorld.docx” file gives us an XML document which consists of a composite of our two vital pieces of information, as follows:

  <Types xmlns="">
    <Override PartName="/word/comments.xml"
    <Default Extension="rels" ContentType="application/vnd.openxmlformats-package.relationships+xml"/>
    <Default Extension="xml" ContentType="application/xml"/>
    <Override PartName="/word/document.xml"
    <Override PartName="/word/styles.xml"
    <Override PartName="/docProps/app.xml"
    <Override PartName="/word/settings.xml"
    <Override PartName="/word/theme/theme1.xml"
    <Override PartName="/word/fontTable.xml"
    <Override PartName="/word/webSettings.xml"
    <Override PartName="/docProps/core.xml"
  <c:zipfile href="file:/C:/work/officecert/hello.docx">
    <c:file compressed-size="368" size="712" name="docProps/app.xml" date="1980-01-01T00:00:00.000Z"/>
    <c:file compressed-size="375" size="747" name="docProps/core.xml"
    <c:file compressed-size="459" size="1004" name="word/comments.xml"
    <c:file compressed-size="539" size="1218" name="word/document.xml"
    <c:file compressed-size="407" size="1296" name="word/fontTable.xml"
    <c:file compressed-size="651" size="1443" name="word/settings.xml"
    <c:file compressed-size="1783" size="16891" name="word/styles.xml"
    <c:file compressed-size="1686" size="6992" name="word/theme/theme1.xml"
    <c:file compressed-size="187" size="260" name="word/webSettings.xml"
    <c:file compressed-size="265" size="948" name="word/_rels/document.xml.rels"
    <c:file compressed-size="372" size="1443" name="[Content_Types].xml"
    <c:file compressed-size="243" size="590" name="_rels/.rels" date="1980-01-01T00:00:00.000Z"/>

The purpose of the information in the Types element is to tell us the MIME types of the contents of the package, either specifically (in Override elements), or indirectly by associating a MIME type with file extensions (in Default elements). What we are now going to do is add another step to our pipeline that resolves all this information so that we label each of the items in the Zip file with the MIME type that applies to it.

    <p:input port="stylesheet">
        <xsl:stylesheet version="2.0"

          <xsl:variable name="ooxml-mappings" select="document('ooxml-map.xml')"/>

          <xsl:template match="/">
              <xsl:copy-of select="/archive-info/c:zipfile/@*"/>

          <xsl:template match="c:file">
            <xsl:variable name="entry-name" select="@name"/>
            <xsl:variable name="toks" select="tokenize($entry-name,'\.')"/>
            <xsl:variable name="ext" select="$toks[count($toks)]"/>
              <xsl:copy-of select="@name"/>
              <xsl:variable name="overriden-type"
              <xsl:variable name="default-type"
              <xsl:variable name="resolved-type"
                select="if(string-length($overriden-type)) then $overriden-type else $default-type"/>
              <xsl:attribute name="resolved-type" select="$resolved-type"/>
              <xsl:attribute name="schema"
              <expand name="{@name}"/>


You’ll notice I am also using an XML document called “ooxml-map.xml” as part of this enrinchment process. This is a file which contains the (hard won) information about which document of which MIME types are governed by which schemas as published as part of the OOXML standard. That document is available online here.

The result of running this additional step is to give us an enriched manifest of the OPC package content:

<c:zipfile xmlns:c=""
  <c:file name="docProps/app.xml"
    <expand name="docProps/app.xml"/>
  <c:file name="docProps/core.xml"
    <expand name="docProps/core.xml"/>
  <c:file name="word/comments.xml"
    <expand name="word/comments.xml"/>
  <c:file name="word/document.xml"
    <expand name="word/document.xml"/>
  <c:file name="word/fontTable.xml"
    <expand name="word/fontTable.xml"/>
  <c:file name="word/settings.xml"
    <expand name="word/settings.xml"/>
  <c:file name="word/styles.xml"
    <expand name="word/styles.xml"/>
  <c:file name="word/theme/theme1.xml"
    <expand name="word/theme/theme1.xml"/>
  <c:file name="word/webSettings.xml"
    <expand name="word/webSettings.xml"/>
  <c:file name="word/_rels/document.xml.rels"
    <expand name="word/_rels/document.xml.rels"/>
  <c:file name="[Content_Types].xml" resolved-type="application/xml" schema="">
    <expand name="[Content_Types].xml"/>
  <c:file name="_rels/.rels"
    <expand name="_rels/.rels"/>

Also notice that each of the items has been given a child element called expand – this is a placeholder for the documents which we are going to expand in situ to create our flat representation of the OPC package content. The pipeline step to achieve that expansion is quite straightforward:

  <p:viewport name="archive-content" match="c:file[contains(@resolved-type,'xml')]/expand">
    <p:variable name="filename" select="/*/@name"/>
      <p:with-option name="href" select="$package-sysid"/>
      <p:with-option name="file" select="$filename"/>

At this point, we're only expanding the content that looks like it is XML – a fuller implementation would expand non-XML content and BASE64 encode it (perfectly doable with XProc).

The result of applying this process is a rather large document, with all the expand elements referring to XML documents replaced by that XML document content … in other words, a flat OPC file. With the additional metadata we have placed on the containing c:file elements, we have enough information to start performing schema validation. I will look at validation in more depth in the next part of this post … becomes LibreOffice

Or, as The Register characteristically puts it, “OpenOffice files Oracle divorce papers”.

This is a very interesting development, and the new LibreOffice project looks much more like a normal community-based open-source project than ever did, with its weird requirement that contributors surrendered their copyright to Sun (then Oracle). The purpose of that always seemed to me that it enabled Sun/Oracle, as the copyright holder, to skip around the viral nature of the GPL and strike deals with other corporations over the code base (so you won't see the all source code for IBM Lotus Symphony freely available, for example). Another consequence was that some useful work done by the Go-OOo project never found its way back into — now though we learn that “that the enhancements produced by the Go-OOo team will be merged into LibreOffice, effective immediately”. In particular I hope this will see better support for OOXML in the future – surely a necessity if LibreOffice is ever to succeed in the substitution game.

One wrinkle is the “cease fire” agreed between Microsoft and Sun (and inherited by Oracle) in which OpenOffice appeared to be granted safety from Patent action by Microsoft. Presumably this will not apply to to the new LibreOffice project …

While this development seems like it might be very good news for open source office suites, it is very unfortunate that the brand has been fragmented with yet another new name for would-be users to get their heads round.