Mastodon
Where is there an end of it? | All posts tagged 'ml'

Notes on Document Conformance and Portability #3

Now that the furore about Microsoft’s implementation of ODF spreadsheet formulas in Office SP2 has died down a little, it is perhaps worth taking a little time to have a calm look at some of the issues involved.

Clearly, this is an area where strong commercial interests are in play, not to mention an element of sometimes irrational zeal from those who consider themselves pro or anti (mostly anti) Microsoft.

One question is whether Microsoft did “The Right Thing” by users in choosing to implement formulas the way they did. This is certainly a fair question and one over which we can expect there to be some argument.

The fact is that Microsoft’s implementation decision means that, on the face of it, they have produced an implementation of ODF which does not interoperate with other available implementations. Thus IBM blogger Rob Weir can produce a simple (possibly simplistic) spreadsheet, “Maya’s Wedding Planner” and used it to illustrate, with helpful red boxes for the slow-witted, that Microsoft’s implementation is a “FAIL” attributable to “malice or incompetence”. For good measure he also takes a side-swipe at Sun for their non-interoperable implementation. In this view, interoperability aligning with IBM’s Symphony implementation is – unsurprisingly – presented as optimal (in fact, you can hear the sales pitch from IBM now: “well, Mr government procurement officer, looks like Sun and MS are not interoperable, you won’t want these other small-fry implementations, and Google’s web-based approach isn’t suitable – so looks like Symphony is the only choice …”)

Microsoft have argued back, of course, most strikingly in Doug Mahugh’s 1 + 2 = 1? blog posting, which appears to present some real problems with basic spreadsheet interoperability among ODF products using undocumented extensions. The MS argument is that practical ODF interoperability is a myth anyway, and so supporting it meaningfully is not possible (in fact, you can hear the sales pitch from MS now: “well, Mr government procurement officer, looks like ODF is dangerously non-interoperable: here, let me show you how IBM and Sun can’t even agree on basic features; but look, we’ve implemented ISO standard formulas, so we alone avoid that – and you can assess whether we’re doing what we claim – looks like MS Office is the only choice …”)

Personally, I think MS have been disappointingly petty in abandoning the “convention” that the other suites more or less use. I accept that these ODF implementations have limited interoperability and are unsafe for any mission-critical data, but for the benefit of the “Maya’s Wedding Planner” type of scenario, where ODF implementations can actually cut it, I think MS should have included this legacy support as an option, even if they did have to qualify that support with warning dialogs about data loss and interoperability issues.

But - vendors are vendors; it is their very purpose to compete in order to maximise their long-term profits. Users don’t always benefit from this. We really shouldn’t be surprised that we have IBM, Sun and Microsoft in disagreement at this point.

What we should be surprised about is how this interoperability fiasco has been allowed to happen within the context of a standard. To borrow Rick Jelliffe’s colourfully reported words, the whole purpose of shoving an international standard up a vendor’s backside it to get them to behave better in the interests of the users. What has gone wrong here is in the nature of the standard itself. ODF offers an extremely weak promise of interoperability, and the omission of a spreadsheet formula specification in ODF 1.1 is merely one of the more glaring facets of this problem. As XML guru James Clark wrote in 2005:

I really hope I'm missing something, because, frankly, I'm speechless. You cannot be serious. You have virtually zero interoperability for spreadsheet documents.

To put this spec out as is would be a bit like putting out the XSLT spec without doing the XPath spec. How useful would that be?

It is essential that in all contexts that allow expressions the spec precisely define the syntax and semantics of the allowed expressions.

These words were prophetic, for now we do indeed face a present zero interoperability reality.

The good news is that work is underway to fix this problem: ODF 1.2 promises, when it eventually appears, to specify formulas using the new OpenFormula specification. When that is published vendors will cease to have an excuse to create non-interoperable implementations, at least in this area.

Is SP2 conformant?

Whether Microsoft’s approach to ODF was the wisest is something over which people may disagree in good faith. Whether their approach conforms to ODF should be a neutral fact we can determine with certainty.

In a follow-up posting to his initial blast, Rob Weir sets out to show that Microsoft’s approach is non-conformant, subsequent to his previous statement that “SP2's implementation of ODF spreadsheets does not, in fact, conform to the requirements of the ODF standard”. After quoting a few selected extracts from the standard, a list is presented showing how various implementations represent a formula:

  • Symphony 1.3: =[.E12]+[.C13]-[.D13]
  • Microsoft/CleverAge 3.0: =[.E12]+[.C13]-[.D13]
  • KSpread 1.6.3: =[.E12]+[.C13]-[.D13]
  • Google Spreadsheets: =[.E12]+[.C13]-[.D13]
  • OpenOffice 3.01: =[.E12]+[.C13]-[.D13]
  • Sun Plugin 3.0: [.E12]+[.C13]-[.D13]
  • Excel 2007 SP2: =E12+C13-D13

Rob writes, “I'll leave it as an exercise to the reader to determine which one of these seven is wrong and does not conform to the ODF 1.1 standard.”

Again, this is clearly aimed at the slow witted. One can imagine even the most hesitant pupil raising their hand, “please Mr Weir, is it Excel 2007 SP2?” Rob however, is too smart to avoid answering the question himself, and anybody who knows anything of ODF will know that, in fact, this is a tricky question.

Accordingly, Dennis Hamilton (ODF TC member and secretary of the ODF Interoperability and Conformance TC) soon chipped in among the blog comments to point out that ODF’s description of formulas is governed by the word “Typically”, rendering it arguably just a guideline. And, as I pointed out in my last post, it is certainly possible to read ODF as a whole as nothing more than a guideline.

(I am glad to be able to report that the word “typically” has been stripped from the draft of ODF 1.2, indicating its existence was problematic.)

Curious readers might like to look for themselves at the (normative) schema for further guidance. Here, we find the formal schema definition for formulas, with a telling comment:

<define name="formula">
  <!-- A formula should start with a namespace prefix, -->
  <!-- but has no restrictions-->
  <data type="string"/>
</define>

Which is yet another confirmation that there are no certain rules about formulas in ODF.

So I believe Rob’s statement that “SP2's implementation of ODF spreadsheets does not, in fact, conform to the requirements of the ODF standard” is mistaken on this point. This might be his personal interpretation of the standard, but it is based on an ingenious reading (argued around the meaning of comma placement, and privileging certain statements over other), and should certainly give no grounds for complacency about the sufficiency of the ODF specification.

As an ODF supporter I am keen to see defects, such as conformance loopholes, fixed in the next published ODF standard. I urge all other true supporters to read the drafts and give feedback to make ODF better for the benefit of everyone, next time around.

Notes on Document Conformance and Portability #1

Richard Gillam’s handy book, Unicode Demystified: A Practical Programmers Guide to the Encoding Standard, contains an example of right-to-left text appearing in a prevailing left-to-right writing direction:

Avram said “מזל טוב.‏” and smiled.

Whether you see here what you are meant to see here will depend on your browser's Unicode support, and whether you have Hebrew fonts installed. Properly rendered, it will look something like this:

In reading order, the first character after “said” is the “מ” character to the left of the closing quotation mark. The text then runs from right to left until the full-stop, and then resumes with “and smiled”. In Unicode, this text is not represented in rendering order, but reading order – it is up to the renderer to make space and reverse direction at the correct points. Here is the text represented as XML in a paragraph in an ODF document (get the document here):

<text:p>Avram said “&#x5de;&#x5d6;&#x5dc; &#x5d8;&#x5d5;&#x5d1;.&#x200f;” and smiled.</text:p>

One of the great things about XML is its solid basis in Unicode and therefore its use of the Universal Character Set (ISO/IEC 10646). XML defines a number of encodings for this character set, and in the XML above the numeric character reference mechanism is used for the Hebrew characters. Notice, just to the left of the full stop the use of U+200F 'RIGHT-TO-LEFT MARK' which specifies that the full stop is part of the right-to-left character sequence.

Viewing this document in three ODF applications (OpenOffice 3, Google Docs with FireFox, and the new MS Office 2007 SP2) give the correct result every time. That is good news.

And if, for an ODF application, the character sequence did not appear correctly (if, say, the full stop was out-of-place) we would be able to say unequivocally that it was faulty; and we would be able to point to the Unicode specification where the correct behaviour was described. We (the user) would be able to bang the table and demand the bug was fixed.

This kind of process is one one of the pillars of conformance testing: application conformance testing, to be exact. Where we have a solid spec and observable behaviour we can compare the two and make a judgement.

Where we don't have a solid spec, things get trickier. For the standardiser's viewpoint, and if its not too highfalutin (and anyway, I claim Cambridge resident's special rights), we might want to quote Wittgenstein on such occasions: "Whereof one cannot speak, thereof one must be silent".

SC 34 Meetings, Prague - Day 1

Digs

SC 34 have a week of meetings in Prague. Today only WG 1 was meeting and I – for the first time – was convening it; an honour, and a slightly daunting one at that.

It was, though, very reassuring to feel that, for the first time since the OOXML days, things were returning to normal, and that the structural changes SC 34 has put in place has allowed WG 1 to return to its true purpose for XML infrastructure technologies: principally schema languages.

It was also great to see wide International participation, with experts in attendance from Canada, China, The Czech Republic, France, Japan, South Africa the United Kingdom.

We had a full agenda and the meeting day varied from some in-the-trenches technical work (prinipally on 19757-8 - DSRL) to some more strategic topics. A couple of these are worth a special mention.

XML 1.0 Fifth Edition

The first is the issue of what to do about XML 1.0 Fifth Edition. The particular revision has caused consternation in some parts of the XML community, by breaking compatibility with earlier versions of XML. XML titans such as Tim Bray, James Clark and David Carlisle have lined up to condemn the move, and Elliotte Rusty Harold has gone so far as to write that "The W3C Core Working group has broken faith with the XML community", and that,

Perhaps the time has come to say that the W3C has outlived its usefulness. Really, has there been any important W3C spec in this millennium that's worth the paper it isn't printed on? [...] I think we might all be better off if the W3C had declared victory and closed up shop in 2001.

Which, if nothing else, shows that when standards get passed which people don't like, the poor standards bodies get it in the neck — a phenomenon regular readers of this blog will have come across before.

So, the practical question is: what do we do about this in SC 34? If we have some Standards which refer to the Fifth edition, and others which refer to earlier editions, then there is a danger those standards are not interoperable, which flies in the face of JTC 1 requirements.

The initial mood around the table seemed to be that politics could be avoided by adopting an approach of "user beware". We would allow standards to mix references to the different versions and if implementations blew up on users then they'd know who to blame: the W3C.

On further reflection, however, consensus seems to be homing in on the idea that it would be better to keep all of our references pointing to XML 1.0 Fourth Edition for now, and to wait until the XML technologies around the Fifth edition matured (W3C has some work to do making XML 5Ed compatibile with other W3C technologies). Then we (and thus users) would be able to embrace 5Ed more enthusiastically; for amid the turmoil it does provide some features (such as a bigger repertoire of name characters) that are wanted by some of our non-Western users.

Schema Copyright

Another interesting issue surrounded schema copyright. When a user downloads a free ISO or IEC standard from ITTF's list, they are bound by a EULA which, inter alia, stipulates:

Under no circumstances may the electronic file you are licensing be copied, transferred, or placed on a network of any sort without the authorization of the copyright owner.

Now this raises a number of questions, but the immediate one facing WG 1 is the issue of schemas. When a standard contains a schema, it is perfectly reasonable for a user to want to extract that and use it for validation – which in most scenarios definitely will require it to be "transferred, or placed on a network".

Following an exchange with Geneva it became apparent that what we should be doing is to include a separate license with the schema, which derogates from the EULA to grant the necessary permissions. Geneva suggested a BSD-esque licence but suggested SC 34 should sensibly innovate around it:

The following permission notice and disclaimer shall be included in all copies of this XML schema ("the Schema"), and derivations of the Schema:
 
Permission is hereby granted, free of charge in perpetuity, to any person obtaining a copy of the Schema, to use, copy, modify, merge and distribute free of charge, copies of the Schema for the purposes of developing, implementing, installing and using software based on the Schema, and to permit persons to whom the Schema is furnished to do so, subject to the following conditions:
 
THE SCHEMA IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SCHEMA OR THE USE OR OTHER DEALINGS IN THE SCHEMA.
 
In addition, any modified copy of the Schema shall include the following
notice:

THIS SCHEMA HAS BEEN MODIFIED FROM THE SCHEMA DEFINED IN ISO xxxxx-y, AND SHOULD NOT BE INTERPRETED AS COMPLYING WITH THAT STANDARD.
 

Already the experts are starting to hack this around and one well-supported thought was to have it submitted to the OSI to ensure it was compatible with any conceivable FOSS scenario. If any reader has expertise in this area, I'd be very interested to hear from them...

XML Prague 2009, Day 2

Boo

I'm afraid I missed the opening 3 talks of the day, as I was too busy fretfully primping my slides in readiness for my talk (directly after the coffee break). Luckily I will be able to catch up with the video later.

Following my talk, Mark Howe and Tony Graham presented on Xcruciate. The audience was very taken with Mark's opening cartoons (do check them out). I'm still not entirely sure I'm grokking what an XML-server actually is. Seems to be a bunch of XML functionality that's remotely invocable ... It's C-based, and there looks to be some interesting stuff under the hood ...

After lunch, Petr Nálevk's topic was "Advanced Automated Authoring with XML. Petr demonstrated an array of snazzy looking documentation generated with a variety of wizardly XML toolchains. Just as notable as these were the snazzy effects on display from his use of an Ubuntu desktop!

Next up, it's Václav Trojan to talk about XDefinition 2.1. This turns out to be a kind of validation language. Václav claims it can operate on data sets of "unlimited size" - and it also appears to allow transformations and miscellaneous XML programming. The strength appears to be its interface with non XML data external to documents - something the current standards are quite weak on. However, the phantasmagoria of functionality on offer seems tobe controlled by a proprietary language stored in attributes - I'm pretty sure I wouldn't start from here.

To round proceedings off, it fell to the effervescent Robin Berjon to give a tour of developments in the SVG space. As promised, his presentation delivered several delightful visual bon bons and proved the perfect end to a great conference!


Robin Berjon
Robin Berjon

XML Prague 2009, Day 1

Night Falls on Old Prague

I am in Prague for the XML Prague conference, and for a week of meetings of ISO/IEC JTC 1 SC 34. Here is a running report of day 1 of the conference ...

Day 1 kicked off, after a welcome from Mohamed Zergaoui, with a presentation from Mike Kay (zOMG - no beard!) on the state of XML Schema 1.1. Mike gave a lucid tour of XML Schema's acknowledged faults, but maintained these must not distract us too much from the technology's industrial usefulness. XML Schema 1.1 looks to me mostly like a modest revamp: some tidying and clarification under the hood. One notable new feature is however to be introduced: assertions - a cut down version of the construct made popular by Schematron. Mike drew something of collective intake of breath when he claimed it was to XML Schema 1.1's advantage that it was incorprating multiple kinds of validation, and that it was "ludicrous" to validate using multiple schema technologies.

A counterpoint to this view came in the next presentation from MURATA Makoto. Murata-san demonstrated the use of NVDL to validate Atom feed which contain extension, claiming NVDL was the only technology that allows this to be done without manually re-editing the core schemas every time a new extension is used.

After coffee, Ken Holman presented on "code lists" - a sort of cinderalla topic within XML validation but an important one, as code lists play a vital role in document validity in most real world XML documents of any substance. Ken outlined a thorough mechanism for validation of documents using code lists based on Genericode and Schematron.

Before Lunch,  Tony Graham took a look at "Testing XSLT" and gave an interesting tour of some of the key technologies in this space. One of his key conclusions, and one which certainly struck a chord with me, was the assertion that ultimately the services of our own eyes are necessary for a complete test to have taken place

Continuing the theme, Jeni Tennison introduced a new XSLT testing framework of her invention: XSpec. I sort of hope I will never have to write substantial XSLTs which merit testing, but if I do then Jeni's framework certainly looks like TDB for TDD!

Next, Priscilla Walmsley took the podium to talk about FunctX a useful-looking general-purpose library of XPath 2.0 (and therefore XQuery) function. Priscilla's talk nicely helped to confirm a theme that has been emerging today, of getting real stuff done. This is not to say there is not a certain geeky intellectualism in the air - but: it's to a purpose.

After tea, Robin Berjon gave an amusing tour of certain XML antipatterns. Maybe because his views largely coincided with mine I thought it a presentation of great taste and insight. Largely, but not entirely :-)

Next up, Ari Nordström gave a presentation on "Practical Reuse in XML". His talk was notable for promoting XLink, which had been a target of Robin Berjon's scorn in the previous session (though now without some contrary views from the floor). Also URNs were proposed as an underpinning for identification purposes - a proposal which drew some protests from the ambient digiverse

To round off the day's proceedings, George Cristian Bina gave a demo of some upcoming features in the next version of the excellent oXygen XML Editor. This is software I am very familiar with, as I use it almost daily for my XML work. George's demo concentrated on the recent authoring mode for oXygen, which allows creation of markup in a more user-friendly wordprocessor-like environment. I've sort of used this on occasion, and sort of felt I've enjoyed it at the time. But somehow I always find myself gravitating back to good old pointy-bracket mode. Maybe I am just an unreconstructed markup geek ...


Breakfast Geek-out
Breakfast Geek-out