Notes on Document Conformance and Portability #1

Richard Gillam’s handy book, Unicode Demystified: A Practical Programmers Guide to the Encoding Standard, contains an example of right-to-left text appearing in a prevailing left-to-right writing direction:

Avram said “מזל טוב.‏” and smiled.

Whether you see here what you are meant to see here will depend on your browser's Unicode support, and whether you have Hebrew fonts installed. Properly rendered, it will look something like this:

In reading order, the first character after “said” is the “מ” character to the left of the closing quotation mark. The text then runs from right to left until the full-stop, and then resumes with “and smiled”. In Unicode, this text is not represented in rendering order, but reading order – it is up to the renderer to make space and reverse direction at the correct points. Here is the text represented as XML in a paragraph in an ODF document (get the document here):

<text:p>Avram said “&#x5de;&#x5d6;&#x5dc; &#x5d8;&#x5d5;&#x5d1;.&#x200f;” and smiled.</text:p>

One of the great things about XML is its solid basis in Unicode and therefore its use of the Universal Character Set (ISO/IEC 10646). XML defines a number of encodings for this character set, and in the XML above the numeric character reference mechanism is used for the Hebrew characters. Notice, just to the left of the full stop the use of U+200F 'RIGHT-TO-LEFT MARK' which specifies that the full stop is part of the right-to-left character sequence.

Viewing this document in three ODF applications (OpenOffice 3, Google Docs with FireFox, and the new MS Office 2007 SP2) give the correct result every time. That is good news.

And if, for an ODF application, the character sequence did not appear correctly (if, say, the full stop was out-of-place) we would be able to say unequivocally that it was faulty; and we would be able to point to the Unicode specification where the correct behaviour was described. We (the user) would be able to bang the table and demand the bug was fixed.

This kind of process is one one of the pillars of conformance testing: application conformance testing, to be exact. Where we have a solid spec and observable behaviour we can compare the two and make a judgement.

Where we don't have a solid spec, things get trickier. For the standardiser's viewpoint, and if its not too highfalutin (and anyway, I claim Cambridge resident's special rights), we might want to quote Wittgenstein on such occasions: "Whereof one cannot speak, thereof one must be silent".

SC 34 Meetings, Prague, Days 2, 3 & 4

What a Difference a Day Makes
MURATA Makoto, WG 4 convenor,
seems pleased with progress on the maintenance of OOXML

I had been intending to write a day-by-day account of these meetings but as it turned out there simply was not time for blogging (tweeting, on the other hand …). Another activity which suffered was photography: I had wanted to take a load of pictures of über-photogenic Prague – but somehow the work seemed to expand into all my notionally free time too. What I did manage to snap is here. And it is also well worth checking out Doug Mahugh’s photos for more.

WG 1 met again on Tuesday to finish its business (you can read the meeting report here) and then my attention mostly turned mostly to the activities in WG 4 (for OOXML maintenance) and WG 5 (which is for OOXML / ODF interop).

OOXML One Year On

Overall, WG 4 is making excellent progress. To date, 169 defects have been collected for OOXML (check out the defect log) and the majority of these have either been closed, or a resolution agreed. Amendments were started for 3 of the 4 parts of OOXML to allow a bunch of small corrections to be put in place, and the even-more-minor problems will be fixed by publishing technical corrigenda. Overall, I think stakeholders in OOXML can feel pretty confident that the Standard is being sensibly and efficiently maintained.

I was personally very pleased to see National Bodies well-represented (the minutes are here) – to the extent that I’d now ideally like to see some more big vendors coming to the table so their views can be heard. Microsoft (of course) was; but where (for example) are Apple, Oracle and the other vendors who participated in Ecma TC 45 while OOXML was being drafted? To them – and to anybody who wants to get involved in this important work – I say: participate!

Over at Rick Jelliffe’s blog Rick has been carrying out something of an exposé of the unfortunate imbalance in the stakeholders represented in the maintenance of ODF at OASIS (something which will become even more acute if Sun is, in the end, snapped-up by IBM). Personally I think Rick is right that it is vitally important to have a good mix of voices at the standardisation table: big vendors, small vendors, altruistic experts, users, government representatives, etc. WG 4 is getting there, but it too has some way to go.

Tougher Issues

Some more controversial topics were however not resolved during the meeting, and I think it may be worth exploring these in more detail:

The Namespace Problem

One issue in particular has proved thorny: whether to changes Namespaces in the OOXML schemas. This topic took a good slice of Tuesday, and then segued into a bar session afterwards; this then carried on over supper and by the time we broke it was midnight. And still no conclusion has been reached. WG 4 has issued a document outlining the state of discussions to date …

I have already expressed my own views on this; but during these Prague meetings some important new considerations were brought to bear. Go over to Jesper Lund Stocholm’s blog to get the thorny details

Personally, I think the “strict” format is a new format, and that changing the Namespace is only part of the solution. I would like to see the media type changes and for OOXML to recommend a new file extension (.dociso anyone?) to reduce the chances that users suffer the (to me) unacceptable fate of silent data loss that Jesper highlights.

Whither Transitional?

Another hot topic of discussion is what the “transitional” version of OOXML is really for. One interesting and slightly surprising fact that emerged after the BRM is that the strict schemas are a true subset of the transitional schemas. Should this nice link be preserved? Microsoft are keen for new features introduced in the “strict” version of OOXML to be mirrored in the “transitional” version – presumably, in part, because Office 14 will use transitional features.

Openness

When the maintenance of OOXML was being planned, one of the principles agreed on by the National Bodies was that the process should be as open as possible, consistent with JTC 1 rules. One aspect of this is the question of whether WG 4’s mailing list archive should be open to the public. Some NBs were a little nervous of this for the reason that their committee members might be less free to post candid comments if they were open to public scrutiny, and possible repercussions with their boss and/or the tinfoil brigade in the blogosphere. There was also the troubling precedent of the mailing list of the U.S. INCITS V1 committee, which opened its archive to public view during the DIS 29500 balloting period, only to see it die completely as contributors refused to post in public view.

The issue will now be put to public ballot, and I am hopeful that mechanisms have been put in place which will allow NBs to support opening of the archive. With public standards, public meeting reports, public discussion documents and a public mailing list archive I think WG 4 will demonstrate that an excellent degree of openness is indeed possible even within the constraints of the current JTC 1 Directives.

Overturning BRM decisions

The UK proposed an interesting new defect during the Prague meetings, which centred on one of the decisions made at the BRM.

Nature of the Defect:

As a result of changes made at the BRM, a number of existing Ecma-376 documents were unintentionally made invalid against the IS29500 transitional schema. It was strongly expressed as an opinion at the BRM by many countries that the transitional schema should accurately reflect the existing Ecma-376 documents.

However, at the BRM, the ST_OnOff type was changed from supporting 0, 1, On, Off, True, False to supporting only 0, 1, True, False (i.e. the xs:boolean type). Although this fits with the detail of the amendments made at the BRM, it is against the spirit of the desired changes for many countries, and we believe that due to time limitations at the BRM, this change was made without sufficient examination of the consequences, was made in error by the BRM (in which error the UK played a part), and should be fixed.

Solution Proposed by the Submitter

Change the ST_OnOff type to support 0, 1, On, Off, True and False in the Transitional schemas only.

The result of the BRM decision being addressed here was apparent in a blog entry I wrote last year, which attracted rather a lot of attention.

Simply put, the UK is now suggesting the BRM made a mistake here, and things should be rectified so that existing MS Office documents “snap back” into being in conformance with 29500 transitional.

This proposal caused some angst. Who were we (some asked) to overturn decisions made at the BRM? My own view is less cautious: this was an obvious blunder, the BRM got it wrong (as it did many things, I think). So let’s fix it.

Whither WG 5?

In parallel with WG 4, WG 5 (the group responsible for ODF/OOXML interoperability) also met. One of the substantive things it achieved was to water down the title of the ongoing report being prepared on this topic, changing it from:

OpenDocument Format (ISO/IEC 26300) / Office Open XML (ISO/IEC 29500) Translation

to:

OpenDocument Format (ISO/IEC 26300) / Office Open XML (ISO/IEC 29500) Translation Guidelines

Adding the word “guidelines” to the title should make it clear to anybody noticing this project that it is not an “answer” to ODF/OOXML interoperability, merely a discursive document. For myself, I have doubts about the ultimate usefulness of such a document.

It is disappointing to see the poor rate of progress on meaningful interoperability and harmonisation work. Of course these things are motherhood and apple pie in discussion – but when the time comes to find volunteers to actually help, few hands go up. In my view, the only hope of achieving any meaningful harmonisation work is to get Another Big Vendor interested in backing it, and I know some behind-the-scenes work will be taking place to beat the undergrowth and see if just such a vendor can be found.

SC 34 Meetings, Okinawa – Day 5 and Summary

Okinawan Entertainment
Singer Azusa Miyagi from the Okinawan pop duo Tink Tink

Day 5 found us all attending a joint session of the working groups to sort out more administrative details and share the recommendations made by WG 5. Anybody interested in seeing the state of play in WG 4’s work can consult the WG 4 web site, where progress on defect correction can be tracked.

Overall this has been I think a successful meeting: the two new working groups are up and running and their work is well underway. There was perhaps the occasional trace of residual angst overhanging from last year: NBs are keen to assert their sovereignty in the decision making process, and the Ecma delegates are keen to be assured the JTC 1 processes can deliver the mechanisms and timeliness needed to keep IS29500 in shape. In general however, there has been a decided “unclenching” as delegates warmed to the (let’s face it) sometime drudgery of maintaining XML document formats. This was all helped by the exceptional hospitality shown by JISC and ITSCJ in hosting the meetings, and in particular by the efforts of WG 4 convenor Murata Makoto. Whenever one wanted to know where to eat, what to drink, or where the prettiest singers could be found, Murata-san was your man!

It was great also to work with Jesper Lund Stocholm, who has also been covering these meetings on his blog. It would be better still if more countries followed Denmark, and more companies followed the shining example of CIBER, in supplying experts for assisting in this important work.

It was something of a shock coming from the 23°C sunshine of Okinawa to freezing snow-bound Britain. And also a shock to review the amount of standards work piling up to be done before the next SC 34 meetings in Prague: defects to be filed, maintenance agreements to be hammered out, agendas to write, ballots to vote on and proposals to draft. I am expecting the Prague meeting to be particularly vibrant, not least since it is preceded by XML Prague 2009. I have not been to an XML Prague before, but have heard only good things about it. Certainly, the programme looks fascinating (though I make no claims for my own presentation). It certainly seems that Prague is going to be the centre of the world for XML-heads everywhere in late March …

SC 34 Meetings, Okinawa - Days 3 & 4

Hotel Walkway at Sunset
Hotel Walkway at Sunset

Two days of hard grind. A lot of administrivia to sort (meeting dates, etc.); many paragraphs of the Directives to read; many defect reports on OOXML to address; and some vigorous discussion to be had about interoperability.

Some concrete progress was made, notably:

  • The first defect reports on ISO/IEC 29500 (aka OOXML) were addressed, and fixes agreed
  • Some principles were established how updates (as opposed to fixes) for OOXML might be processed
  • Some useful discussions in WG 5 clarified the scope of the ongoing work drafting a technical report giving guidance on how 29500 (OOXML) and 26300 (ODF) can interoperate

From my perspective, the most exciting discussion during these meetings centred on a presentation from the ODF editor, Patrick Durusau, on what he called “true” interoperability. Patrick (betraying his Topic Maps background) set out a suggestion that a PSI might be created to identify the document constructs described by the two document format Standards, and that each PSI might be in turn associated with metadata and documentation related to that construct. Essentially, this approach views the “problem” of interoperability between ODF and OOXML as a problem of documentation — though Patrick also pointed out that the interoperability problem had already been solved by corporations (maybe he meant Microsoft, for example) and that these corporations were, perhaps churlishly, keeping the information to themselves.

I see the establishment of such rich descriptive material as being a first important step on a road which leads to the dissolving of what we currently see as meaningful differences between the document formats. Perhaps in time the rest of the world will come to realise too that when we talk of a preference for ODF and OOXML we are, in the main, expressing a preference for syntax, and that the juvenile “OOXML vs ODF” arguments – however much they are loaded with corporate agendas masquerading as moral superiority – will achieve precisely nothing for those who matter: the end users.

SC 34 Meetings, Okinawa - Day 0

Okinawan Bloom
It is nice to get away from the freezing drizzle of the UK,
to the milder climes and bright sunshine of Okinawa.

I am in Okinawa for a week of ISO/IEC JTC 1 SC 34 meetings. To be precise, these are not meetings of SC 34 itself (there will be no plenary), rather the week will be taken up with two activities by parts of SC 34:

  • On Monday and Tuesday, a team picked by our Chairman will meet to discuss the maintenance procedures for ODF among themselves, and with OASIS representatives.
  • On Wednesday, Thursday and Friday SC 34’s two new working groups, WG 4 and WG 5, will meet.

These in turn will generate plenty of input for SC 34’s full Prague meeting in March.

ODF Maintenance

I have already written about the background to this activity, both the issues caused by the current lack of agreement on how maintenance should proceed, and JTC 1’s instruction to SC 34 from Nara that SC 34 and OASIS should develop a document specifying “detailed operation of joint maintenance procedures”.

At this stage the negotiations are completely informal, and expected simply to offer an opportunity for all parties to have an open discussion aimed at increasing the level of mutual understanding to a point where they are ready to start working together in earnest on drafting the agreement text. For SC 34, this text will need to be presented to members in time for consideration in Prague, at which meeting it will seek SC 34’s blessing to be passed up to JTC 1 for further consideration.

WG 4 & WG 5

Okinawa will see the first two meetings of our two new working groups, WG 4 (dedicated to maintenance of ISO/IEC 29500, aka OOXML), and WG 5 (dedicated to document file format interoperability). Both groups are expected to meet face-to-face more frequently than the rest of SC 34, and to make heavy use of the newfangled teleconferencing technology that JTC 1 has recently embraced.

WG4’s business in the short term will be largely taken up with correcting defects in the 29500 text (in JTC 1 parlance, producing corrigenda) in response to reported defects. A number of these have been submitted already, by Japan, the UK and Ecma themselves. The UK has a large number on additional ones brewing and is likely to submit a second batch in February.

WG 5’s short-term work is to concentrate on the Technical Report (a more informal document that an International Standard) being drafted which sets out some of the considerations when mapping between ISO/IEC 26300 (ODF 1.0) and ISO/IEC 29500 (OOXML). I’m wondering too whether there will be any moves in this WG to garner support for new work in this area. Now that the dust has settled over document formats themselves, even non XML experts are beginning to grok that by themselves these standards don’t actually give us that much, but are a useful foundation on which to work. “Interoperability” in particular requires so much more than simply having standardised document formats. I await developments in this space with interested anticipation …