[Open-data-census] machine readable definition

Rufus Pollock rufus.pollock at okfn.org
Mon Oct 7 19:42:01 BST 2013


On 5 October 2013 21:43, Andrew Stott <andrew.stott at dirdigeng.com> wrote:

> Rufus****
>
> ** **
>
> I’m rather more relaxed about properly structured HTML where the data
> could be programmatically extracted (although most examples would fail the
> bulk download case).
>

Hmmm, I'm in 2 minds on this but incline to saying HTML is not machine
readable as you almost always have to do siginificant work to re-extract
info. More below ...


> For instance if an agency want to make a data table available as HTML
> under an open licence and this is both viewable and programmatically,
> reliably, parsable in order to get the data then it is hard to see this is
> not open data.****
>
> ** **
>
> However it would not be open data if:****
>
> ** **
>
> (1) the data is shown as, for instance, images within the HMTL – not
> programmatically extractable.****
>
> ** **
>
> (2) the data is shown as implications for formatting rather than as data
> itself (eg colouring – cf the OKFN Census league table (!))****
>
> ** **
>
> (3) the data “appears” as the result of user interaction and/or the
> execution of scripts – that defeats automatic, programmable parsing.****
>
> ** **
>
> Conversely at one time UK Civil Service vacancies (largely structured
> text) were shown on various UK Government websites with RDFa attributes in
> the HTML tags precisely in order to be scrapable.  This sort of technology
> could also be a solution to publication of contractual documents – frankly
> more useful than downloadable PDFs or Microsoft Word file.
>

I think RDFa is one thing (and I'd put RDFa as the format rather than HTML
or perhaps HTML/RDFa) but I'd say that, by default, HTML is not
machine-readable because it always needs parsing (and most HTML is quite
bad HTML).


> ****
>
> ** **
>
> As Ivan Begtin has pointed out, simply because a dataset is expressed in
> XML it does not mean that it is machine readable in any sort of practical
> way.
>

I'd say it is much more machine-readable ;-)

Machine-readability is definitely one of the more subtle items when you get
to the edges - i actually have a series of "bad-data" examples in progress
to illustrate some of the edge cases at

http://okfnlabs.org/bad-data/

And there are a number of mapping and postcode cases where the results are
> in open formats but are not machine-readable in the sense that you could
> extract the data and reuse it.
>
> In my view we should look at machine readable as a combination of fact and
> objective judgement, and not say that a particular format is automatically
> machine-readable or not machine-readable.
>

That is definitely a good point but I would say that *usually* HTML would
not be machine readable (perhaps we need a weak and strong form ;-) of it!)

Rufus
*

*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/open-data-census/attachments/20131007/83f93f64/attachment.htm>


More information about the Open-data-census mailing list