Beedle dee, dee dee dee, two formats [updated]

Punch & Judy

Two pieces of recent news offer an interesting commentary on the continuing evolution of office document formats and applications.

In Germany, following last year’s pained presentation on its attempts to adopt ODF, the city of Freiburg, along with a number of other European public administrations, has been funding an open source project to improve OpenOffice’s support for the OOXML format — the goal being to advance the case that users can switch from MS Office suite to an open source alternative, confident in the ability to interoperate between the two. The project is bearing fruit, with its first results claiming to have fixed “three of five biggest OOXML support issues”.

The news is timely with – it seems – a vote imminent in Freiburg’s city council to decide whether to “end its floundering migration of OpenOffice and to stop using the Open Document Format” … it will be interesting to see how this plays out.

[Update 2012-11-21: It appears Freiburg has voted to abandon its plan to migrate to OpenOffice (and for them it seems this means ODF too)].

Meanwhile, in Portugal …

But ODF advocates need not despair. While in Freiburg it might seem that ODF is the boat anchor holding FOSS office suites back, a thousand miles away in Portugal the Portuguese Government has published a list that mandates a number of open standards to be adopted in the Portuguese public administration. ODF is there; OOXML is not. The only wrinkle is that the version of ODF specified is 1.1 — which, interestingly, is the very version of ODF that MS Office currently happens to support.

Double the fun?

This raises the intriguing possibility that in one part of Europe officials may be using FOSS office suites to work with OOXML documents, while elsewhere MS Office will be used to work with ODF documents. If nothing else, this kind of thing is likely to accelerate the demand from users for developers and standardizers to address the remaining areas where format interoperability remains less than clear-cut — but in this scenario I’d also have to say I’d feel sorry for the users – especially those working on more complex documents – where those “less than clear-cut” areas are likely to be all-too apparent …

Rethinking OOXML Validation, Part 1

ODF Plugfest Venue
Brussels ODF Plugfest venue

At the recent ODF Plugfest in Brussels, I was very interested to hear Jos van den Oever of KOffice present on how ODF’s alternative “flat” document format could be used to drive browser based rendering of ODF documents. ODF defines two methods of serializing documents: one uses multiple files in a “Zip” archive, the aforementioned “flat” format combines everything into a single XML file. Seeing this approach in action gelled with some thoughts I’d been having on how better to validate OOXML documents using standards-based XML tools …

Unlike ODF, OOXML has no “flat” file format – its files are OPC packages built on top of Zip archives. However, some interesting work has already been done in this area by Microsoft’s Eric White in such as blog posts as The Flat OPC Format, which points out that Microsoft Word™ (alone among the Office™ suite members [UPDATE: Word and PowerPoint can do this]) can already save in an unofficial flat format which can be processed with standards-based XML tools like XSLT processors.

Rather than having to rely on Word, or stick only to word processing documents, I thought it would be interesting to explore ways in which any OOXML document could be flattened and processed using standards-based processors. Ideally one would then also write a tool that did the opposite so that to work with OOXML content the steps would be first to flatten it, then to do the processing, and then to re-structify it into an OPC package.

Back to XProc

I have already written a number of blog posts on office document validation, and have used a variety of technical approaches to get the validation done. Most of my recent effort has been on developing the Office-o-tron, a hand-crafted Java application which functions primarily by unpacking archives to the file system before operating on their individual components. Earlier efforts using XProc has foundered on the difficulty of working with files inside a Zip archive — in particular because I was using the non-standard JAR URI scheme which, it turns out, is not capable of addressing items with certain names (e.g. “Object 1”) that one typically finds inside ODF documents.

However, armed with knowledge gained from developing Office-o-tron, and looking again at Zip handling extension functions of the Calabash XProc processor, made me think there was a way XProc could be used to get the job done. Here’s how …

Inspecting an OPC package

OOXML documents are built using the Open Packaging Convention (OPC, or ISO/IEC 29500-2), a generic means of building file formats within Zip archives which also happens to underpin the XPS format. OPC’s chief virtue – that it is very generic – is offset by much (probably too much) complexity in pursuit of this goal. Before we can know what we’ve got in an OPC package, and how to process it, some work needs to be done.

Fortunately, the essence of what we need consists of two pieces of information: a file inside the Zip guaranteed to be called “[Content_Types].xml”, and a manifest of the content of the package. XProc can get both of these pieces of information for us:

<?xml version="1.0"?>
<p:pipeline name="consolidate-officedoc"

  <p:import href="extensions.xpl"/>

  <!-- specifies the document to be processed -->
  <p:option name="package-sysid" required="true"/>

  Given the system identifer $package-sysid of an OOXML document,
  this pipeline returns a document whose root element is archive-info
  which contains two children: the [Content_Types].xml resource
  contained in the root of the archive, and a zipfile element
  created per the unzip step at:
  <p:pipeline type="xo:archive-info">

    <p:option name="package-sysid" required="true"/>

    <cx:unzip name="content-types" file="[Content_Types].xml">
      <p:with-option name="href" select="$package-sysid"/>

    <cx:unzip name="archive-content">
      <p:with-option name="href" select="$package-sysid"/>


    <p:wrap-sequence wrapper="archive-info">
      <p:input port="source">
        <p:pipe step="content-types" port="result"/>
        <p:pipe step="archive-content" port="result"/>


  <!-- get the type information and content of the package -->
    <p:with-option name="package-sysid" select="$package-sysid"/>

  <!-- etc -->

Executing this pipeline on a typical “HelloWorld.docx” file gives us an XML document which consists of a composite of our two vital pieces of information, as follows:

  <Types xmlns="">
    <Override PartName="/word/comments.xml"
    <Default Extension="rels" ContentType="application/vnd.openxmlformats-package.relationships+xml"/>
    <Default Extension="xml" ContentType="application/xml"/>
    <Override PartName="/word/document.xml"
    <Override PartName="/word/styles.xml"
    <Override PartName="/docProps/app.xml"
    <Override PartName="/word/settings.xml"
    <Override PartName="/word/theme/theme1.xml"
    <Override PartName="/word/fontTable.xml"
    <Override PartName="/word/webSettings.xml"
    <Override PartName="/docProps/core.xml"
  <c:zipfile href="file:/C:/work/officecert/hello.docx">
    <c:file compressed-size="368" size="712" name="docProps/app.xml" date="1980-01-01T00:00:00.000Z"/>
    <c:file compressed-size="375" size="747" name="docProps/core.xml"
    <c:file compressed-size="459" size="1004" name="word/comments.xml"
    <c:file compressed-size="539" size="1218" name="word/document.xml"
    <c:file compressed-size="407" size="1296" name="word/fontTable.xml"
    <c:file compressed-size="651" size="1443" name="word/settings.xml"
    <c:file compressed-size="1783" size="16891" name="word/styles.xml"
    <c:file compressed-size="1686" size="6992" name="word/theme/theme1.xml"
    <c:file compressed-size="187" size="260" name="word/webSettings.xml"
    <c:file compressed-size="265" size="948" name="word/_rels/document.xml.rels"
    <c:file compressed-size="372" size="1443" name="[Content_Types].xml"
    <c:file compressed-size="243" size="590" name="_rels/.rels" date="1980-01-01T00:00:00.000Z"/>

The purpose of the information in the Types element is to tell us the MIME types of the contents of the package, either specifically (in Override elements), or indirectly by associating a MIME type with file extensions (in Default elements). What we are now going to do is add another step to our pipeline that resolves all this information so that we label each of the items in the Zip file with the MIME type that applies to it.

    <p:input port="stylesheet">
        <xsl:stylesheet version="2.0"

          <xsl:variable name="ooxml-mappings" select="document('ooxml-map.xml')"/>

          <xsl:template match="/">
              <xsl:copy-of select="/archive-info/c:zipfile/@*"/>

          <xsl:template match="c:file">
            <xsl:variable name="entry-name" select="@name"/>
            <xsl:variable name="toks" select="tokenize($entry-name,'\.')"/>
            <xsl:variable name="ext" select="$toks[count($toks)]"/>
              <xsl:copy-of select="@name"/>
              <xsl:variable name="overriden-type"
              <xsl:variable name="default-type"
              <xsl:variable name="resolved-type"
                select="if(string-length($overriden-type)) then $overriden-type else $default-type"/>
              <xsl:attribute name="resolved-type" select="$resolved-type"/>
              <xsl:attribute name="schema"
              <expand name="{@name}"/>


You’ll notice I am also using an XML document called “ooxml-map.xml” as part of this enrinchment process. This is a file which contains the (hard won) information about which document of which MIME types are governed by which schemas as published as part of the OOXML standard. That document is available online here.

The result of running this additional step is to give us an enriched manifest of the OPC package content:

<c:zipfile xmlns:c=""
  <c:file name="docProps/app.xml"
    <expand name="docProps/app.xml"/>
  <c:file name="docProps/core.xml"
    <expand name="docProps/core.xml"/>
  <c:file name="word/comments.xml"
    <expand name="word/comments.xml"/>
  <c:file name="word/document.xml"
    <expand name="word/document.xml"/>
  <c:file name="word/fontTable.xml"
    <expand name="word/fontTable.xml"/>
  <c:file name="word/settings.xml"
    <expand name="word/settings.xml"/>
  <c:file name="word/styles.xml"
    <expand name="word/styles.xml"/>
  <c:file name="word/theme/theme1.xml"
    <expand name="word/theme/theme1.xml"/>
  <c:file name="word/webSettings.xml"
    <expand name="word/webSettings.xml"/>
  <c:file name="word/_rels/document.xml.rels"
    <expand name="word/_rels/document.xml.rels"/>
  <c:file name="[Content_Types].xml" resolved-type="application/xml" schema="">
    <expand name="[Content_Types].xml"/>
  <c:file name="_rels/.rels"
    <expand name="_rels/.rels"/>

Also notice that each of the items has been given a child element called expand – this is a placeholder for the documents which we are going to expand in situ to create our flat representation of the OPC package content. The pipeline step to achieve that expansion is quite straightforward:

  <p:viewport name="archive-content" match="c:file[contains(@resolved-type,'xml')]/expand">
    <p:variable name="filename" select="/*/@name"/>
      <p:with-option name="href" select="$package-sysid"/>
      <p:with-option name="file" select="$filename"/>

At this point, we're only expanding the content that looks like it is XML – a fuller implementation would expand non-XML content and BASE64 encode it (perfectly doable with XProc).

The result of applying this process is a rather large document, with all the expand elements referring to XML documents replaced by that XML document content … in other words, a flat OPC file. With the additional metadata we have placed on the containing c:file elements, we have enough information to start performing schema validation. I will look at validation in more depth in the next part of this post … becomes LibreOffice

Or, as The Register characteristically puts it, “OpenOffice files Oracle divorce papers”.

This is a very interesting development, and the new LibreOffice project looks much more like a normal community-based open-source project than ever did, with its weird requirement that contributors surrendered their copyright to Sun (then Oracle). The purpose of that always seemed to me that it enabled Sun/Oracle, as the copyright holder, to skip around the viral nature of the GPL and strike deals with other corporations over the code base (so you won't see the all source code for IBM Lotus Symphony freely available, for example). Another consequence was that some useful work done by the Go-OOo project never found its way back into — now though we learn that “that the enhancements produced by the Go-OOo team will be merged into LibreOffice, effective immediately”. In particular I hope this will see better support for OOXML in the future – surely a necessity if LibreOffice is ever to succeed in the substitution game.

One wrinkle is the “cease fire” agreed between Microsoft and Sun (and inherited by Oracle) in which OpenOffice appeared to be granted safety from Patent action by Microsoft. Presumably this will not apply to to the new LibreOffice project …

While this development seems like it might be very good news for open source office suites, it is very unfortunate that the brand has been fragmented with yet another new name for would-be users to get their heads round.

SC 34 Meetings in Tokyo

Tokyo Tower
Tokyo Tower

Last week I attended a busy week of meetings of ISO/IEC JTC 1/SC 34 in Tokyo. Here’s an update on what is going on in the International Standard document format scene …

(For reference, the meeting resolutions are here).


The DSDL project is drawing to a conclusion, and one of its final parts, Part 11 (“Schema Association”) is now ready to progress to DIS (Draft International Standard) stage, following its passage (without either dissent or comment) through the CD (Committee Draft) stage: congratulations to Project Editor Jirka Kosek!

One of the interesting things about this project is procedural: it is a standard being developed in parallel with the W3C (you can see their version of it here). I encourage everybody to take a look and report any comments to our mailing list,


e-book markets are taking off worldwide, and the dominant format is emerging to be EPUB, standardized by the International Digital Publishing Forum (IDPF) – a consortium. Although there is an International Standard in this space - IEC 62448 - it is fair to say, I think, that in the market, EPUB has clearly won. IEC 62448 is up for revision but technically it appears to lack some key features National Bodies expect to see in International Standards – notably comprehensive support for BIDI (bi-directional text) as used by many Middle Eastern writing systems. I am sure National Bodies (including the UK’s) will be monitoring this space very closely: in 2010 it is surely not acceptable to produce International (International!) Standards which ignore large portions of this planet’s population.

Following some behind-the-scenes discussion between IDPF and SC 34 people, it has been agreed that it could be worthwhile exploring whether and how the EPUB format could undergo International Standardization in SC 34. EPUB builds on some SC 34 technologies (notably DSDL), makes use of ZIP, and can find a suitable group of experts in SC 34 with a broad range of documentation and publishing experience. To progress matters, a resolution was agreed to ballot whether an exploratory study should be initiated:

SC 34 establishes an ad hoc group on EPUB of IDPF with the following terms of reference:
  • to discuss with IDPF whether their EPUB work should be standardized at SC 34 and to present a plan for such standardization and any other recommendations to the next SC 34 Plenary in 2011 March.
  • to determine the major stakeholders concerned with EPUB standardization and propose additional liaisons that would enable these stakeholders to be represented in any standardization process.
Membership is open to ISO/IEC JTC 1/SC 34 P and O members, liaison organizations, and subgroup representatives.

SC 34 appoints Dr. Makoto MURATA (Japan) and Dr. Yong-Sang CHO (Korea) as the Co-Conveners of this ad hoc group.

Personally, I think that since IDPF has done such a good job on EPUB already it would be a shame to do anything that would risk that ongoing goodness, and that some kind of parallel standardization activity with SC 34 would be appropriate – perhaps in the same kind of way that SC 2 and the Unicode Consortium keep ISO/IEC 10646 and Unicode in parallel. We shall see …

ISO “Zip”

Back in March I blogged about how SC 34 had opened a ballot on standardizing aspects of the “Zip” format. The proposal failed, with 10 countries voting against starting work (11 countries were in favour on this question and 10 abstained – so the necessary super-majority was not obtained).

Nevertheless in discussion in Tokyo it emerged that National Bodies were broadly in favour of an eventual “Zip” standard (or at least a standards-compatible specification) of some kind, and that it was more the nature of the proposal, rather than its aim, that was in question. Since I was the person who drafted the proposal, this is in large part mea culpa!

One of the chief reasons why the ballot failed was U.S. corporate influence. With IBM spearheading the effort, and the likes of Oracle and Microsoft generally going along with it, committee members in several countries were – I hear – energized to oppose the proposal. One particular concern was that PKWare, Inc. had had no opportunity to have input in the process. This concern is certainly reasonable, and there is no doubt that PKWare is quite a major stakeholder. But “Zip” is bigger even than that since it sits at the heart of some many specifications (e.g. ODF, OOXML and EPUB) and is in the foundations of many other technologies (e.g. Java, Linux … and Windows). So, to gather the widest possible stakeholder input going forward to a possible second try at standardizing “Zip”, SC 34 resolved to embark upon a study period:

SC 34 accepts the WG 1 recommendation contained in SC 34 N 1494 to initiate a study period with aim of establishing a firmer rationale for standardization of aspects of the “Zip” format.

SC 34 asks WG 1 that a report be submitted in time for consideration at the SC 34 meetings in Prague in 2011-03 and that time be allocated to this activity during the WG 1 meeting in Beijing in 2010-12.

SC 34 instructs its Secretariat to issue a call for participation in this Study Period to SC 34 national and liaison member bodies.

SC 34 requests the SC 34 chairman bring this activity to the attention of members of JTC 1, and other SC chairs, at the upcoming JTC 1 Plenary in Belfast.

One delegate told me the “Zip” vote had engendered some very heated arguments, more intense than those even of the OOXML wars – and many experts in SC 34 take the high politics already in evidence as an indication that there may be “unknown unknowns” surrounding this format. In my view, “Zip” is simply too important for it to have continuing IPR/technical uncertainties and I would like to see those put to rest by the standardization process.


From Monday to Wednesday in WG 4 work continued on the OOXML Standard: primarily fixing defects (and in particular fixing problems with date handling – a fix important enough to warrant its own amendments). One positive development is that some more large-scale thinking about textual reform has started, with drafting of a possible revised version of Part 3 (“Markup Compatibility and Extensibility” aka MCE) underway. And for the gigantic Part 1 (5,572 pages) some experimentation removing redundant tables of elements has shown we can achieve a 10% decrease in size, and we can lose several hundred pages by moving the tutorial material of the “Primer” out of the text. Ultimately it is this kind of thoroughgoing re-organisation and re-writing (currently still just an experiment) that will redeem the text.

Can there be redemption for Microsoft, whose Office 2010 product has now hit the shelves using the deprecated transitional variant of OOXML and a load of Microsoft extensions? Well, in time, maybe …

There has been much discussion in WG 4 how to standardize Microsoft’s extensions which – although they use the extension mechanisms described by the IS 29500 – are not themselves described in any standard. They’re currently documented on MSDN. How should they be standardized? In a multi-part Standard? in a registry? or what? Ultimately WG 4 concluded we should do nothing – we are not hearing any market demand for standardizing Microsoft’s extensions and so we will wait. Of course this means that as Microsoft adds more and more extensions to subsequent versions of Office the proportion of it described by the text of IS 29500 will diminish. We shall have to wait and see what the market thinks about that. Personally, I feel it is critical that procurers of OOXML-based suites pay careful attention to this aspect of MS Office and (I have written this before) know that MS Office 2007 – not 2010 – is the only version which (modulo bugs/defects) conforms to OOXML unextended. It is my guess that future large-scale procurers of MS Office may want to specify which extensions they want (maybe none), and I would like to see the conformance language of OOXML beefed-up making such procurement specifications easier.

Following the announcement of Microsoft’s plans that future versions of MS Office will use OOXML Strict, it was interesting to speculate how this leap was going to be achieved. One particularly horrid suggestion was that MS Office should effectively continue to use OOXML Transitional (but within a Strict wrapper) by relying on some new “extensions” that contained deprecated Transitional markup. On the other hand, using such extensions purely to provide graceful degradation for legacy applications seemed like an excellent idea. One of the key benefits of a move to OOXML Strict is that developers targeting future MS Office versions will be able (if it supports Strict properly) to ignore the 1,500 pages of nastiness that is the Transitional format. That would be a definite win.

For fun, during the week, I used SoftMaker’s office suite, which claims support for OOXML. Rather than lengthen even further an already over-long posting and save a report on this software for a later posting …


WG 6 (concerned with ODF) met for a teleconference at 7am on Thursday and the convenor reported on this, and other ODF-related matters, later that day.

The chief activity in WG 6 at the moment is the creation of an amendment to ISO/IEC 26300 (the International Standard version of ODF) so that it aligns properly with ODF 1.1. The result of this activity will be that the latest version of ODF will be in sync between JTC 1 and OASIS. The drafting work for this is nearly done, with some final tests and tweaks being made to ensure everything has been squared between the variants.

An interesting issue has arisen with the submission to WG 6 by the Dutch of a proposed amendment to ODF which would improve its change tracking to the point where they judged it could be acceptable for government use. Now, the flaky/incomplete nature of change tracking in ODF and its office suites has been something of an elephant in the room ever since Microsoft’s Doug Mahugh blogged about it, and I hear that Microsoft have used some nifty demonstrations of OpenOffice’s change tracking to quell the enthusiasm of large procuring bodies who were considering stepping out of line and switching away from MS Office. Indeed, I believe OASIS’s own Open Document Format Interoperability and Conformance (OIC) TC has a test-case which demonstrates interop problems with change tracking.

So one would have thought that the ODF TC – if they had not done it already – would have leapt at the Dutch proposal as a way to close an important competitive gap … But no, from the minutes of a recent meeting we learn that “the change tracking proposals is a topic for ODF-Next rather than ODF 1.2”; and so it has been deferred to ODF-Next (i.e. two versions of ODF into the future). As I understand it, the ODF TC wishes to meet a certain deadline for release of ODF 1.2, which has already been a fair while in drafting. In a little Twitter exchange I had with a co-chair of the ODF TC (Rob Weir) on this topic he tweeted that ODF perfection was not required. I leave it to readers of this blog to judge whether a desire for solid comprehensive change-tracking is really the same as an unrealistic demand for “perfection” (which, I agree, would be unreasonable).

I can certainly imagine ISO and IEC members being very reluctant to pass a version of ODF (should 1.2 ever come to JTC 1) that does not have a convincing story to tell about change-tracking.


The week of meetings finished with a 1½ day BRM (ballot resolution meeting) for DIS 14297, also known as “UOML (Unstructured Operation Markup Language) Part 1 Version 1.0” (get it here). Since the process used was an accelerated procedure (called PAS, which is practically the same as a Fast Track) and since the DIS ballot had failed, the meeting had many potential parallels with the OOXML BRM of 2008, and so I was particularly interested to attend – not this time as convenor, but out of the firing-line as a member of the UK delegation.

The problem with UOML 1.0 is, in nearly every aspect, very poor quality – it is almost gibberish, and obviously so to anybody who cares even to glance at it. It makes OOXML look like something written by Bertrand Russell. How the OASIS members could have approved it as a standard boggles the mind; how the OASIS Board could have okayed it for PAS submission boggles the mind; and how it nearly passed its JTC 1 ballot boggles the mind. Fortunately, just enough National Bodies voted against the DIS to make it fail its ballot, and so on Friday afternoon a group of us found ourselves in a room with 139 comments to resolve. After triage we found that left us with 10 minutes per comment – compared to the OOXML BRM this was luxury!

Many aspects of the UOML BRM were familiar: voting on comments in batches, NBs feeling grumpy about having compromised (rather than really good) dispositions; other NBs feeling grumpy about lack of time. Credit must go to the new consulting project editor Joel Marcey, who had valiantly deployed his skills to make very substantial improvements to the text prior to the meeting. And credit must go to Paul Cotton (“The convenor’s convenor”) whose in-the-trenches experiences of chairmanship (SQL, HTML 5, ...) were in evidence to ensure a productive and good-natured meeting.

But in the end it just was not enough. All the dispositions were resolved one way or the other, but a number of NBs had rejected proposals and there was a general perception (by my reckoning) that the standard remained unimplementable in a conformant manner.

So yet again we have an instance of a poorly-drafted standard coming into a process that finds it very difficult to take the strain when there are a reasonably large number of NB comments. I can imagine accelerated standardization working well when a standard is small and perfectly formed but when a standard is large and/or buggy (and when NBs bother to read it), trouble is almost bound to ensue. Although the JTC 1 Directives have been recently revised they do very little to address this particular problem of the BRM being potentially a crazy-time in which major changes are carried out in a very compressed time period. Reform in this area is still badly needed, I believe.

And away from the committees …

Tokyo was quite an experience: extremely humid and hot most of the time, but with one day of torrential rain as we experienced the outer bands of a passing typhoon. For once I was glad to be able to spend most of the time in air-conditioned meeting rooms, even if this meant there were disappointingly few opportunities for photography.

Food was a highlight, especially since Murata-san guided us to a number of truly outstanding eating experiences including a sashimi banquet, a more rustic meal starring yakitori chicken, and a meal which built to a climax of tuna shabu-shabu. All this was washed down with a variety of rice-based, sugar cane-based and potato-based wine and spirits. Yumsk! (and, hic!)

Hello Sake
Hello sake!

[Update: my pictures from the week are here.]