[od-discuss] A harmonised Open Format definition
pm286 at cam.ac.uk
Wed Apr 22 16:08:28 UTC 2015
On Wed, Apr 22, 2015 at 4:32 PM, Herb Lainchbury <herb at dynamic-solutions.com
> I think this is a useful but difficult area, and it would be great if we
> could make the format section of the OD more robust as a result.
> Some years ago I worked on a rating system to rate the "useability" of
> open data. We called it the ODUI (Open Data Usability Index). It was
> somewhat crude, but we found it useful when explaining to publishers, why
> some published material was more useable than other material. The result
> of using this index was that a dataset would end up with a certain number
> of stars, from 1 to 5.
I think this is a useful approach.
> As part of this exercise we decided that because formats themselves can be
> used suboptimally (as in the given examples in this thread), describing
> open formats was not enough.
Yes. I like the "suboptimally". An XML document consisting of a base64 PNG
derived from a table is suboptimal. A PNG of a photograph of a bird is as
good as it gets. An Open Format has to make a reasonable effort to be as
semantic as possible.
> To get at what we wanted we created a criteria group called "readability"
> and broke it down into four sub-criteria. They are:
> 1. Digital : The data is available electronically (as opposed to just
> hard-copy print-outs)
> 2. Parse-able: The data can be parsed by a standard parser such as XML,
> JSON, CSV
Agreed. I think this is a useful term. For example PDF is parse-able. It
doesn't guarantee that the results is interpretable. For example it may
have custom fonts instead of using Unicode. (I would insist on Unicode if
it's a reasonable approach in the context - I spend far too much time with
> 3. Open Format:
> YES = XML, JSON, CSV, SHP, SQL99, XHTML, PNG, SVG
> NO = PDF, Word, XLS, FLASH, PSD
I'd agree, but note that PDF and Word are ISO standards. Would we regard
ODT as Open? We need to be clear on terms.
> 4. Structured: The file is composed of regular structures such as rows and
> columns, or objects, such that the various attributes represented by the
> data are easily accessible with simple reusable program.
> (e.g. CSV files with rows and columns with the first row representing the
> column names)
I agree that structuring is critical.
> It's the last one, "Structured" that speaks to using the format in a
> useful way.
> Breaking the idea of formats into these separate criteria proved useful to
> help get at what we wanted from publishers.
> Here is the original document: http://goo.gl/xGpLIs
> On Wed, Apr 22, 2015 at 1:01 AM, Peter Murray-Rust <pm286 at cam.ac.uk>
>> On Wed, Apr 22, 2015 at 8:41 AM, Rufus Pollock <rufus.pollock at okfn.org>
>>> On 21 April 2015 at 20:16, Andrew Stott <andrew.stott at dirdigeng.com>
>>>> We also need to be careful about terms like "machine-readable" - a PNG
>>>> file of a national budget is machine-readable (or, at least, more readable
>>>> by a machine than by a human!) but its machine-readability does not make
>>>> the data in it easily reusable.
>> This is a very important point but I don't think there is a simple answer
>> or term, and I think we have the opportunity to make a contribution.
>> I have struggled with this for 20 years (having been a facilitator of the
>> XML process (XML-DEV) and now actively involved in trying to develop
>> programs that "understand" PDF documents). I try to consider two categories
>> of disadvantaged "readers":
>> * blind humans
>> * machines
>> I have used the term "machine-readable" to mean that a stream of bits can
>> be read which can be displayed to a knowledgeable sighted human and
>> potentially "understood" by them.
>> And "machine understandable" to mean that the machine can add some
>> significant meaning to the bits beyond simply displaying them on the
>> I also use "born digital" to mean documents created in a computer which
>> can, if properly transmitted without corruption, retain a significant part
>> of their meaning. Many of our problems come from born-digital documents
>> being dumbed down to PDF or TIF which destroys much or all of the
>> semantics. A similar problem happens when born-digital documents are
>> printed and then re-scanned.
>> Many processes today actively encourage the destruction of born-digital
>> content. Thus a student writes their thesis in Word or LaTeX and they are
>> required to transform it to PDF as the "archival" version. IMO this
>> avoidable action alone destroys 10 Billion dollars of scientific value per
>> year. Similar processes happen with government and companies. (I have
>> become involved in trying to "machine understand" the documents that
>> companies submit to Companies House).
>> Similar things happen with diagrams and graphs. They are born digital, as
>> vectors, and then destroyed into pixels PNG or (even worse) JPEG.
>> Peter Murray-Rust
>> Reader in Molecular Informatics
>> Unilever Centre, Dep. Of Chemistry
>> University of Cambridge
>> CB2 1EW, UK
>> od-discuss mailing list
>> od-discuss at lists.okfn.org
>> Unsubscribe: https://lists.okfn.org/mailman/options/od-discuss
> od-discuss mailing list
> od-discuss at lists.okfn.org
> Unsubscribe: https://lists.okfn.org/mailman/options/od-discuss
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the od-discuss