This is one in a series of popular blog articles I am re-publishing from the old Griffin Brown blog which is now closed down. This article is from April 2008. It is the same content as the original (except for some hyperlink freshening).
At the time of posting this entry caused quite a furore, even though its results were – to me anyway – as expected. Looking back I think what I wrote was largely correct, except I probably underestimated the difficulty of converting Microsoft Office to use the Strict variant of OOXML — this would require more than surgery just to the de-serialisation code!
I was excited to receive from Murata Makoto a set of the RELAX NG schemas for the (post-BRM) revision of OOXML, and thought it would be interesting to validate some real-world content against them, to get a rough idea of how non-conformant the standardisation of 29500 had made MS Office 2007.
Not having Office 2007 installed at work (our clients aren't using it – yet), the first problem is actually getting a reasonable sample for testing. Fortunately, the Ecma 376 specification itself is available for download from Ecma as a .docx file, and this hefty document is a reasonable basis for a smoke test ...
The main document ("document.xml") content for Part 4 of Ecma 376 weighs in at approx. 60MB of XML. Looking at it ... I'm sorry, but I'm not working on that size of document when it's spread across only two lines. Pretty-printing the thing makes it rather more usable, but pushes the file size up to around 100MB.
So we have a document and a RELAX NG schema. All that's necessary now it to use jing (or similar) and we can validate ...
Validating against the STRICT model
The STRICT conformance model is quite a bit different from Ecma 376, essentially because most of that format's most notorious features (non ISO dates, compatibility settings like autospacewotnot, VML, etc.) have been removed. Thus the expectation is that existing Office 2007 documents might be some distance away from being valid according to the strict schemas.
Sure enough, jing emitted 17MB (around 122,000) of invalidity messages when validating in this scenario. Most of them seem to involve unrecognised attributes or attribute values: I would expect a document which exercised a wider range of features to generate a more diverse set of error message.
Validating against the TRANSITIONAL model
The TRANSITIONAL conformance model is quite a bit closer to the original Ecma 376. Countries at the BRM (rather more than Ecma, as it happened) were very keen to keep compatibilty with Ecma 376 and to preserve XML structures at which legacy Office features could be targetted. The expectation is therefore that an MS Office 2007 document should be pretty close to valid according to the TRANSITIONAL schema.
Sure enough (again) the result is as expected: relatively few messages (84) are emitted and they are all of the same type complaining e.g. of the element:
since the allowed attribute values for val are now "true", "false", etc. — this was one of the many tidying-up exercices performed at the BRM.
Such a test is only indicative, of course, but a few tentative conclusions can be drawn:
- Word documents generated by today's version of MS Office 2007 do not conform to ISO/IEC 29500
- Making them conform to the STRICT schema is going to require some surgery to the (de)serialisation code of the application
- Making them conform to the TRANSITIONAL will require less of the same sort of surgery (since they're quite close to conformant as-is)
Given Microsoft's proven ability to tinker with the Office XML file format between service packs, I am hoping that MS Office will shortly be brought into line with the 29500 specification, and will stay that way. Indeed, a strong motivation for approving 29500 as an ISO/IEC standard was to discourage Microsoft from this kind of file format rug-pulling stunt in future.
To repeat the exercise with ISO/IEC 26300:2006 (ODF 1.0) and a popular implementation of OpenDocument. Will anybody be brave enough to predict what kind of result that exercise will have?