[od-discuss] A harmonised Open Format definition

Wed Apr 22 16:08:28 UTC 2015

On Wed, Apr 22, 2015 at 4:32 PM, Herb Lainchbury <herb at dynamic-solutions.com
> wrote:

> I think this is a useful but difficult area, and it would be great if we
> could make the format section of the OD more robust as a result.
>
> Some years ago I worked on a rating system to rate the "useability" of
> open data.  We called it the ODUI (Open Data Usability Index).  It was
> somewhat crude, but we found it useful when explaining to publishers, why
> some published material was more useable than other material.  The result
> of using this index was that a dataset would end up with a certain number
> of stars, from 1 to 5.
>

I think this is a useful approach.

> As part of this exercise we decided that because formats themselves can be
> used suboptimally (as in the given examples in this thread), describing
> open formats was not enough.
>

Yes. I like the "suboptimally". An XML document consisting of a base64 PNG
derived from a table is suboptimal. A PNG of a photograph of a bird is as
good as it gets. An Open Format has to make a reasonable effort to be as
semantic as possible.

> To get at what we wanted we created a criteria group called "readability"
> and broke it down into four sub-criteria.  They are:
>
> 1. Digital : The data is available electronically (as opposed to just
> hard-copy print-outs)
>
> 2. Parse-able: The data can be parsed by a standard parser such as XML,
> JSON, CSV
>

Agreed. I think this is a useful term. For example PDF is parse-able. It
doesn't guarantee that the results is interpretable. For example it may
have custom fonts instead of using Unicode. (I would insist on Unicode if
it's a reasonable approach in the context - I spend far too much time with
non-Unicode PDFs.

>
> 3. Open Format:
>           YES = XML, JSON, CSV, SHP, SQL99, XHTML, PNG, SVG
>           NO = PDF, Word, XLS, FLASH, PSD
>
>
I'd agree, but note that PDF and Word are ISO standards. Would we regard
ODT as Open? We need to be clear on terms.

> 4. Structured: The file is composed of regular structures such as rows and
> columns, or objects, such that the various attributes represented by the
> data are easily accessible with simple reusable program.
> (e.g. CSV files with rows and columns with the first row representing the
> column names)
>
>
I agree that structuring is critical.

>
> It's the last one, "Structured" that speaks to using the format in a
> useful way.
>
> Breaking the idea of formats into these separate criteria proved useful to
> help get at what we wanted from publishers.
>
> Here is the original document: http://goo.gl/xGpLIs
>
> Herb
>
>
>
>
>
>
> On Wed, Apr 22, 2015 at 1:01 AM, Peter Murray-Rust <pm286 at cam.ac.uk>
> wrote:
>
>>
>>
>> On Wed, Apr 22, 2015 at 8:41 AM, Rufus Pollock <rufus.pollock at okfn.org>
>> wrote:
>>
>>> On 21 April 2015 at 20:16, Andrew Stott <andrew.stott at dirdigeng.com>
>>> wrote:
>>>
>>>>
>>>>
>>>> We also need to be careful about terms like "machine-readable" - a PNG
>>>> file of a national budget is machine-readable (or, at least, more readable
>>>> by a machine than by a human!) but its machine-readability does not make
>>>> the data in it easily reusable.
>>>>
>>>
>> This is a very important point but I don't think there is a simple answer
>> or term, and I think we have the opportunity to make a contribution.
>>
>> I have struggled with this for 20 years (having been a facilitator of the
>> XML process (XML-DEV) and now actively involved in trying to develop
>> programs that "understand" PDF documents). I try to consider two categories
>> of disadvantaged "readers":
>>  * blind humans
>>  * machines
>>
>> I have used the term "machine-readable" to mean that a stream of bits can
>> be read which can be displayed to a knowledgeable sighted  human and
>> potentially "understood" by them.
>>
>> And "machine understandable" to mean that the machine can add some
>> significant meaning to the bits beyond simply displaying them on the
>> screen.
>>
>> I also use "born digital" to mean documents created in a computer which
>> can, if properly transmitted without corruption, retain a significant part
>> of their meaning. Many of our problems come from born-digital documents
>> being dumbed down to PDF or TIF which destroys much or all of the
>> semantics. A similar problem happens when born-digital documents are
>> printed and then re-scanned.
>>
>> Many processes today actively encourage the destruction of born-digital
>> content. Thus a student writes their thesis in Word or LaTeX and they are
>> required to transform it to PDF as the "archival" version. IMO this
>> avoidable action alone destroys 10 Billion dollars of scientific value per
>> year. Similar processes happen with government and companies. (I have
>> become involved in trying to "machine understand" the documents that
>> companies submit to Companies House).
>>
>> Similar things happen with diagrams and graphs. They are born digital, as
>> vectors, and then destroyed into pixels PNG or (even worse) JPEG.
>>
>> P
>>
>>
>>
>>
>> --
>> Peter Murray-Rust
>> Reader in Molecular Informatics
>> Unilever Centre, Dep. Of Chemistry
>> University of Cambridge
>> CB2 1EW, UK
>> +44-1223-763069
>>
>> _______________________________________________
>> od-discuss mailing list
>> od-discuss at lists.okfn.org
>> https://lists.okfn.org/mailman/listinfo/od-discuss
>> Unsubscribe: https://lists.okfn.org/mailman/options/od-discuss
>>
>>
>
>
> --
>
>
> _______________________________________________
> od-discuss mailing list
> od-discuss at lists.okfn.org
> https://lists.okfn.org/mailman/listinfo/od-discuss
> Unsubscribe: https://lists.okfn.org/mailman/options/od-discuss
>
>

-- 
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/od-discuss/attachments/20150422/b49df536/attachment-0003.html>