[od-discuss] A harmonised Open Format definition

Peter Murray-Rust pm286 at cam.ac.uk
Wed Apr 22 08:01:23 UTC 2015


On Wed, Apr 22, 2015 at 8:41 AM, Rufus Pollock <rufus.pollock at okfn.org>
wrote:

> On 21 April 2015 at 20:16, Andrew Stott <andrew.stott at dirdigeng.com>
> wrote:
>
>>
>>
>> We also need to be careful about terms like "machine-readable" - a PNG
>> file of a national budget is machine-readable (or, at least, more readable
>> by a machine than by a human!) but its machine-readability does not make
>> the data in it easily reusable.
>>
>
This is a very important point but I don't think there is a simple answer
or term, and I think we have the opportunity to make a contribution.

I have struggled with this for 20 years (having been a facilitator of the
XML process (XML-DEV) and now actively involved in trying to develop
programs that "understand" PDF documents). I try to consider two categories
of disadvantaged "readers":
 * blind humans
 * machines

I have used the term "machine-readable" to mean that a stream of bits can
be read which can be displayed to a knowledgeable sighted  human and
potentially "understood" by them.

And "machine understandable" to mean that the machine can add some
significant meaning to the bits beyond simply displaying them on the
screen.

I also use "born digital" to mean documents created in a computer which
can, if properly transmitted without corruption, retain a significant part
of their meaning. Many of our problems come from born-digital documents
being dumbed down to PDF or TIF which destroys much or all of the
semantics. A similar problem happens when born-digital documents are
printed and then re-scanned.

Many processes today actively encourage the destruction of born-digital
content. Thus a student writes their thesis in Word or LaTeX and they are
required to transform it to PDF as the "archival" version. IMO this
avoidable action alone destroys 10 Billion dollars of scientific value per
year. Similar processes happen with government and companies. (I have
become involved in trying to "machine understand" the documents that
companies submit to Companies House).

Similar things happen with diagrams and graphs. They are born digital, as
vectors, and then destroyed into pixels PNG or (even worse) JPEG.

P




-- 
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/od-discuss/attachments/20150422/8593f094/attachment-0003.html>


More information about the od-discuss mailing list