[od-discuss] A harmonised Open Format definition

Herb Lainchbury herb at dynamic-solutions.com
Wed Apr 22 15:32:35 UTC 2015

I think this is a useful but difficult area, and it would be great if we
could make the format section of the OD more robust as a result.

Some years ago I worked on a rating system to rate the "useability" of open
data.  We called it the ODUI (Open Data Usability Index).  It was somewhat
crude, but we found it useful when explaining to publishers, why some
published material was more useable than other material.  The result of
using this index was that a dataset would end up with a certain number of
stars, from 1 to 5.
As part of this exercise we decided that because formats themselves can be
used suboptimally (as in the given examples in this thread), describing
open formats was not enough.  To get at what we wanted we created a
criteria group called "readability" and broke it down into four
sub-criteria.  They are:

1. Digital : The data is available electronically (as opposed to just
hard-copy print-outs)

2. Parse-able: The data can be parsed by a standard parser such as XML,

3. Open Format:
          NO = PDF, Word, XLS, FLASH, PSD

4. Structured: The file is composed of regular structures such as rows and
columns, or objects, such that the various attributes represented by the
data are easily accessible with simple reusable program.
(e.g. CSV files with rows and columns with the first row representing the
column names)

It's the last one, "Structured" that speaks to using the format in a useful

Breaking the idea of formats into these separate criteria proved useful to
help get at what we wanted from publishers.

Here is the original document: http://goo.gl/xGpLIs


On Wed, Apr 22, 2015 at 1:01 AM, Peter Murray-Rust <pm286 at cam.ac.uk> wrote:

> On Wed, Apr 22, 2015 at 8:41 AM, Rufus Pollock <rufus.pollock at okfn.org>
> wrote:
>> On 21 April 2015 at 20:16, Andrew Stott <andrew.stott at dirdigeng.com>
>> wrote:
>>> We also need to be careful about terms like "machine-readable" - a PNG
>>> file of a national budget is machine-readable (or, at least, more readable
>>> by a machine than by a human!) but its machine-readability does not make
>>> the data in it easily reusable.
> This is a very important point but I don't think there is a simple answer
> or term, and I think we have the opportunity to make a contribution.
> I have struggled with this for 20 years (having been a facilitator of the
> XML process (XML-DEV) and now actively involved in trying to develop
> programs that "understand" PDF documents). I try to consider two categories
> of disadvantaged "readers":
>  * blind humans
>  * machines
> I have used the term "machine-readable" to mean that a stream of bits can
> be read which can be displayed to a knowledgeable sighted  human and
> potentially "understood" by them.
> And "machine understandable" to mean that the machine can add some
> significant meaning to the bits beyond simply displaying them on the
> screen.
> I also use "born digital" to mean documents created in a computer which
> can, if properly transmitted without corruption, retain a significant part
> of their meaning. Many of our problems come from born-digital documents
> being dumbed down to PDF or TIF which destroys much or all of the
> semantics. A similar problem happens when born-digital documents are
> printed and then re-scanned.
> Many processes today actively encourage the destruction of born-digital
> content. Thus a student writes their thesis in Word or LaTeX and they are
> required to transform it to PDF as the "archival" version. IMO this
> avoidable action alone destroys 10 Billion dollars of scientific value per
> year. Similar processes happen with government and companies. (I have
> become involved in trying to "machine understand" the documents that
> companies submit to Companies House).
> Similar things happen with diagrams and graphs. They are born digital, as
> vectors, and then destroyed into pixels PNG or (even worse) JPEG.
> P
> --
> Peter Murray-Rust
> Reader in Molecular Informatics
> Unilever Centre, Dep. Of Chemistry
> University of Cambridge
> CB2 1EW, UK
> +44-1223-763069
> _______________________________________________
> od-discuss mailing list
> od-discuss at lists.okfn.org
> https://lists.okfn.org/mailman/listinfo/od-discuss
> Unsubscribe: https://lists.okfn.org/mailman/options/od-discuss

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/od-discuss/attachments/20150422/3f298dff/attachment-0003.html>

More information about the od-discuss mailing list