[okfn-labs] fyi - Schema.org Proposal for table data descriptions

Fri Aug 16 06:39:43 UTC 2013

Tom,

This is very interesting indeed - especially the datapackage stuff they
describe later. It would be interesting to explore how to connect this to
the data.okfn style data packages.

Michael

On Wed, Aug 14, 2013 at 11:56:53AM -0400, Tom Morris wrote:
> Related to OKFN table description work...
> 
> ---------- Forwarded message ----------
> From: Omar Benjelloun (عمر بنجلون) <benjello at google.com>
> Date: Tue, Aug 13, 2013 at 5:04 PM
> Subject: Proposal: Looking inside tables
> To: public-vocabs at w3.org
> Cc: Ramanathan Guha <guha at google.com>, Dan Brickley <danbri at google.com>
> 
> 
> Hi,
> 
> Many useful datasets on the Web take the form of tables. The goal of this
> proposal is to provide a simple, schema.org-friendly way to "look inside"
> these tables, and map their contents into triples.
> 
> This is an early draft proposal developed at Google. We're seeking feedback
> from the community.
> 
> The proposal is attached to this e-mail, and will be uploaded to the
> WebSchemas/SchemaDotOrgProposals page shortly.
> 
> Thanks,
> -Omar

>                        Describing tables with schema.org
> 
>    Last updated: 8/05/2013
> 
>                              Overview & motivation
> 
>    Many useful data exist in the form of tables, either inside HTML
>    documents, or as standalone files in various formats (CSV, XLS, ODF),
>    or in relational database management systems. While tables are a
>    natural representation of data for many applications, they do not
>    provide the necessary semantics to enable search engines to effectively
>    index, and surface to users the information they contain.
> 
>    The goal of this proposal is to help bridge between data tables and
>    triples, by providing a simple mechanism to describe data tables so
>    that their contents can be understood in terms of entities and
>    properties.
> 
>    This work is similar in spirit to the [1]R2RML W3C recommendation. The
>    main differences are that 1) this proposal is oriented towards the
>    needs and expertise of mainstream schema.org publishers and 2) that it
>    is not aimed at relational databases, but focuses on annotating tables
>    that are available on the Web, and tries to make the mark-up as simple
>    as possible for this use cases. It is likely that R2RML configurations
>    can be automatically generated in many cases.
> 
>    This is an early proposal. Comments are welcome.
> 
>                              Marking up HTML tables
> 
>    Consider the list of paintings by Rembrandt on the following wikipedia
>    page:
> 
>    [2]http://en.wikipedia.org/wiki/List_of_paintings_by_Rembrandt
> 
>    Here are the first few rows (image URLs in the first column are
>    rendered in the table):
> 
>    Image
> 
>    Title
> 
>    Year
> 
>    Technique
> 
>    Dimensions
> 
>    Gallery
> 
>    [150px-Rembrandt_van_Rijn_181.jpg]
> 
>    The Operation (Touch)
> 
>    1624/1625
> 
>    Oil on panel
> 
>    21.6 x 17.7 cm.
> 
>    Private collection
> 
>    [150px-Rembrandt_van_Rijn_180.jpg]
> 
>    The Spectacles-pedlar (Sight)
> 
>    1624/1625
> 
>    Oil on panel
> 
>    21 x 17.8 cm.
> 
>    [3]Stedelijk Museum De Lakenhal, Leiden
> 
>    [150px-Rembrandt_The_Three_Singers_%28Hearing%29.jpg]
> 
>    The Three Singers (Hearing)
> 
>    1624/1625
> 
>    Oil on panel
> 
>    21.6 x 17.8 cm.
> 
>    Collection W. Baron van Dedem
> 
>    We would like each row of the table to be described as a
>    [4]http://schema.org/Painting. Today, this has to be done by marking up
>    each row individually. This process is repetitive and error-prone, and
>    the resulting mark-up is verbose.
> 
>    Instead, we propose marking up the columns in the header of the table
>    to say which properties of Painting they correspond to.
> 
>    Here is what it looks like in rdfa syntax:
> 
>    <table typeof="Painting" vocab="http://schema.org/">
> 
>      <thead>
> 
>        <tr>
> 
>          <th property="image">Image</th>
> 
>          <th property="name">Title</th>
> 
>          <th property="dateCreated">Year</th>
> 
>          <th>Technique</th>
> 
>          <th>Dimensions</th>
> 
>          <th property="contentLocation">Gallery</th>
> 
>        </tr>
> 
>      </thead>
> 
>    <tbody>...</tbody>
> 
>    </table>
> 
>    By putting marking on the table header, we're saying that each row of
>    the table corresponds to an instance of the class specified in typeof
>    (here, http://schema.org/Painting), and that the annotated columns
>    contain the values of the properties of Painting they are marked up
>    with.
> 
>    For instance, the first row of the table will be interpreted as a
>    Painting instance with the following property values:
>      * image:
>        [5]http://en.wikipedia.org/wiki/File:Rembrandt_van_Rijn_181.jpg
>      * name: "The Operation (Touch)"
>      * dateCreated: "1624/1625"
>      * contentLocation: "Private collection"
> 
>    This specific way of interpreting mark-up to apply to all the rows of a
>    table is only intended for annotations on table elements and their
>    headers (inside a thead or th elements). This differenciates it from
>    regular mark-up, which describes individual entities and therefore is
>    not meaningful on table and table header elements.
> 
>    Note that the values of columns can be either text (e.g., the title or
>    the year), or URLs (image), or both. These should be resolved based on
>    the type of the target property. Such tables (e.g., in Wikipedia) will
>    often vary unpredictably between hypertext and simpler atomic values.
>    For now we do not address the question of normalizing such content.
> 
>    Side comment: The example above does not map all the columns of the
>    table, because the Painting schema does not define properties that
>    correspond to Technique and Dimensions. The use of "contentLocation"
>    for Gallery is also a bit of a stretch.
> 
>   Constant properties
> 
>    The table above omits a critical piece of information about the
>    paintings: their painter! This can be represented by adding a
>    "constant" property, using a meta tag:
> 
>    <table typeof="Painting" vocab="http://schema.org/">
> 
>      <thead>
> 
>        <tr>
> 
>          <th property="image">Image</th>
> 
>          <th property="name">Title</th>
> 
>          <th property="dateCreated">Year</th>
> 
>          <th>Technique</th>
> 
>          <th>Dimensions</th>
> 
>          <th property="contentLocation">Gallery</th>
> 
>          <link property="author"
> 
>                href="[6]http://en.wikipedia.org/wiki/Rembrandt"/>
> 
>        </tr>
> 
>      </thead>
> 
>    <tbody>...</tbody>
> 
>    </table>
> 
>    Marking up external tables (CSV, etc)
> 
>    Many useful tables live outside HTML documents, in CSV files. Mark-up
>    doesn't really work in this case. Instead we're proposing using JSON-LD
>    to define table mappings.
> 
>    Suppose our paintings table is instead available as a CSV file, with
>    the following contents (dropping the image, Technique and Dimensions
>    columns for readability)
> 
>    Title,Year,Gallery
> 
>    The Operation (Touch),1624/1625,Private collection
> 
>    The Spectacles-pedlar (Sight),1624/1625,Stedelijk Museum De
>    Lakenhal,Leiden
> 
>    The Three Singers (Hearing),1624/1625,Collection W. Baron van Dedem
> 
>    First, we need a mechanism to reference the columns of this table. For
>    CSV tables on the Web, we can use [7]URI fragment identifiers to
>    address columns inside the table, either by their header or by their
>    position.
> 
>    For example, assuming the CSV table is available at the following URL:
> 
>    http://wp.org/rembrandt-paintings.csv
> 
>    The Year column can be referenced by its name:
> 
>    http://wp.org/rembrandt-paintings.csv#col:Year
> 
>    Or by its position
> 
>    http://wp.org/rembrandt-paintings.csv#col:1
> 
>    Mapping our CSV table and its columns to Painting and its properties
>    can be done in [8]JSON-LD as follows:
> 
>    {
> 
>      "@context": "http://schema.org/",
> 
>      "@type": "Painting",
> 
>      "name": "{http://wp.org/rembrandt-paintings.csv#col:Title}",
> 
>      "dateCreated" : "{http://wp.org/rembrandt-paintings.csv#col:Year}",
> 
>      "contentLocation" :
>    "{http://wp.org/rembrandt-paintings.csv#col:Gallery}",
> 
>      "author": "http://en.wikipedia.org/wiki/Rembrandt"
> 
>    }
> 
>    This definition above can either be embedded in an HTML document about
>    the table, by placing it in a script tag with the type attribute set to
>    application/ld+json (per JSON-LD [9]spec), or provided as part of a
>    dataset bundle (see below).
> 
>    The meaning of this definition is very similiar to the mark-up version
>    above. @type specifes the type of the elements that will be generated
>    from the table (i.e., Painting). Properties of painting can reference
>    columns of tables or constant strings. References to table columns are
>    surrounded by curly braces (e.g.,
>    "{[10]http://wp.org/rembrandt-paintings.csv#col:Title}", but constant
>    values are not (e.g., "[11]http://en.wikipedia.org/wiki/Rembrandt").
> 
>    This syntax is based on the RFC for [12]URI templates, and allows for
>    richer patterns, as described below.
> 
>   Patterns
> 
>    In some cases, it is useful to combine values from multiple columns
>    together, or with constant strings in order to create values. This can
>    be achieved by mixing references (in curly braces) with text. For
>    instance, we could generate a description of the painting by
>    concatenating the Title, Year and Gallery columns, separated by text as
>    follows:
> 
>    {
> 
>      "@context": {
> 
>        "@vocab" : "http://schema.org/",
> 
>        "rp" : "http://wp.org/rembrandt-paintings.csv#"
> 
>      },
> 
>      "@type": "Painting",
> 
>      "name": "{rp:col:Title}",
> 
>      "dateCreated" : "{rp:col:Year}",
> 
>      "contentLocation" : "{rp:col:Gallery}",
> 
>      "author" : "[13]http://en.wikipedia.org/wiki/Rembrandt",
> 
>      "description" : "{rp:col:Title}, {rp::col:Year}, {rp:col:Gallery}."
> 
>    }
> 
>    The description field will generate the following description for the
>    first row in the table:
> 
>    The Operation (Touch), 1624/1625, Private collection.
> 
>    Note that this example used the JSON-LD [14]prefix expansion mechanism
>    to keep references to columns short, so given the context definition
>    "rp" : "http://wp.org/rembrandt-paintings.csv#" the name property value
>    "{rp:col:Title}" is equivalent to
>    "{http://wp.org/rembrandt-paintings.csv#col:Title}".
> 
>    It it tempting to make the pattern language richer, to support
>    restructuring and transforming the values foud in tables. There is a
>    trade-off between that expressiveness and the simplicity and
>    ease-of-use of the language.
> 
> Bundling datasets
> 
>    Many datasets consist of more than a single table. They may contain
>    multiple CSV files, and the description of the entities they contain
>    may be quite complex.
> 
>    We propose extending the [15]http://schema.org/Dataset class to act as
>    a container for such datasets, by adding the following property to it:
> 
>      * datasetElement(CreativeWork) an element that is published as part
>        of this dataset. The element can be any kind of CreativeWork (e.g.,
>        images, programs, etc.). In particular, a dataset may contain data
>        tables (e.g, as CSV files), or JSON-LD descriptions.
> 
>    For example, consider an art catalog dataset consisting of many CSV
>    files. Each CSV file lists the paintings of one painter in the catalog.
> 
>    We propose creating one Dataset that "bundles" all these files, and a
>    json-ld file that contains the entity mappings for each table.
> 
>    The dataset definition will look like:
> 
>    <script type="application/ld+json">
> 
>    {
> 
>      "@context": "http://schema.org",
> 
>      "@id": "paintings-dataset",
> 
>      "@type": "Dataset",
> 
>      "datasetElement": [
> 
>        {"@id": "rembrandt-paintings.csv",
> 
>         "@type": "DataDownload",
> 
>         "encodingFormat": "text/csv",
> 
>         "contentUrl": "http://wp.org/rembrandt-paintings.csv"
> 
>        },
> 
>        {"@id": "renoir-paintings.csv",
> 
>         "@type": "DataDownload",
> 
>         "encodingFormat": "text/csv",
> 
>         "contentUrl": "http://wp.org/renoir-paintings.csv"
> 
>        },
> 
>        …,
> 
>        {"@id": "paintings",
> 
>         "@type": "DataDownload",
> 
>         "encodingFormat": "application/json",
> 
>         "contentUrl": "http://wp.org/paintings.json"
> 
>        }
> 
>       ]
> 
>    }
> 
>    </script>
> 
>    The CSV files for each painter look like the one in the example above.
>    The paintings.json file contains a list of table mappings like the one
>    above:
> 
>    [{
> 
>      "@context": {
> 
>        "@vocab" : "http://schema.org/",
> 
>        "p1" : "http://wp.org/rembrandt-paintings.csv#"
> 
>      },
> 
>      "@type": "Painting",
> 
>      "name": "{p1:col:Title}",
> 
>      "dateCreated" : "{p1:col:Year}",
> 
>      "contentLocation" : "{p1:col:Gallery}",
> 
>      "author" : "http://en.wikipedia.org/wiki/Rembrandt"
> 
>    },
> 
>    {
> 
>      "@context": {
> 
>        "@vocab" : "http://schema.org/",
> 
>        "p2" : "http://wp.org/renoir-paintings.csv#"
> 
>      },
> 
>      "@type": "Painting",
> 
>      "name": "{p2:col:Title}",
> 
>      "dateCreated" : "{p2:col:Year}",
> 
>      "contentLocation" : "{p2:col:Gallery}",
> 
>      "author" : "http://en.wikipedia.org/wiki/Renoir"
> 
>    },...]
> 
>                            Identifiers and references
> 
>    In schema.org and other vocabularies, URLs are used as primary
>    identifiers of entities. We illustrate through an example how they can
>    easily be created and used for referencing entities.
> 
>    We will use the patterns described above to generate URLs for entities
>    from column values.
> 
>    In the case or URLs, the pattern language we use here is actually very
>    similar to [16]URI templates.
> 
>    Suppose we have two tables, one that lists countries and one that lists
>    cities.
> 
>    Countries
> 
>    country-code
> 
>    country-name
> 
>    DE
> 
>    Germany
> 
>    FR
> 
>    France
> 
>    Cities
> 
>    city-code
> 
>    city-name
> 
>    city-country
> 
>    PAR
> 
>    Paris
> 
>    FR
> 
>    BER
> 
>    Berlin
> 
>    DE
> 
>    Let's assume these table exist as two CSV files at the following
>    locations:
>      * http://my.domain.org/countries.csv
>      * http://my.domain.org/cities.csv
> 
>    The country table can be described as follows:
> 
>    {
> 
>      "@context": {
> 
>        "@vocab": "http://schema.org",
> 
>        "t1": "http://my.domain.org/countries.csv#"
> 
>      }
> 
>      "@type": "Country",
> 
>      "@id": "http://my.domain.org/country/{t1:col:country-code}",
> 
>      "name": "{t1:col:country-name}"
> 
>    }]
> 
>    The "@id" field is used in JSON-LD to specify the URL of an entity. We
>    use a pattern to construct a URL for each country, based on its country
>    code value
> 
>    The city table description can create a reference to the country of
>    each city through its url:
> 
>    {
> 
>      "@context":{
> 
>        "@vocab": "http://schema.org",
> 
>        "t2": "http://my.domain.org/cities.csv#"
> 
>      }
> 
>      "@type": "City",
> 
>      "@id": "http://my.domain.org/city/{t2:col:city-code}",
> 
>      "name": "{t2:col:city-name}",
> 
>      "containedIn": "http://my.domain.org/country/{t2:col:city-country}"
> 
>    },
> 
>                               Nested descriptions
> 
>    The example above is a [17]normalized representation: Because each
>    country may contain multiple cities, cities and countries are
>    represented as separate tables, and each table containing exactly one
>    row per entity.
> 
>    In reality, tables are often denormalized, with redundant information
>    in them. For instance, there may be a single city / country table that
>    looks like:
> 
>    city-code
> 
>    city-name
> 
>    city-country
> 
>    country-name
> 
>    PAR
> 
>    Paris
> 
>    FR
> 
>    France
> 
>    LIL
> 
>    Lille
> 
>    FR
> 
>    France
> 
>    BER
> 
>    Berlin
> 
>    DE
> 
>    Germany
> 
>    MUN
> 
>    Munich
> 
>    DE
> 
>    Germany
> 
>    This table can be described as follows:
> 
>    {
> 
>      "@context": {
> 
>        "@vocab": "http://schema.org",
> 
>        "t3": "http://my.domain.org/cities-counties.csv#"
> 
>      }
> 
>      "@type": "City",
> 
>      "@id": "http://my.domain.org/city/{t3:col:city-code}",
> 
>      "name": "{t3:col:city-name}",
> 
>      "containedIn": {
> 
>         "@type": "Country",
> 
>         "@id": "http://my.domain.org/country/{t3:col:city-country}",
> 
>         "name": "{t3:col:country-name}"
> 
>      }
> 
>    }
> 
>    We use JSON-LD's ability to embed structured values to nest the
>    definition of countries (with their URLs and names) inside the
>    definition of cities.
> 
>    Each row of the table will still generate one city entity, however not
>    every row will generate a different country. Countries with the same
>    "@id" attribute will be merged into a single entity. This happens
>    naturally when the triples are generated, because they will have the
>    same subject.
> 
>                          Using JSON-LD with HTML tables
> 
>    At the beginning of this document, we showed how to describe (simple)
>    HTML tables with mark-up, and then introduced a JSON-LD syntax to
>    mark-up external CSV tables. This approach can also be used for HTML
>    tables, provided identifiers are defined for table columns.
> 
>    For instance, the Country table above can be written in HTML as
>    follows:
> 
>      <table>
> 
>        <tr>
> 
>          <th id="country-code">country</th>
> 
>          <th id="country-name">name</th>
> 
>        </tr>
> 
>        <tr>
> 
>          <td>FR</td><td>France</td>
> 
>        </tr>
> 
>        <tr>
> 
>          <td>DE</td><td>Germany</td>
> 
>        </tr>
> 
>      </table>
> 
>    This table can be described by a JSON-LD fragment embedded in the same
>    document:
> 
>    <script type="application/ld+json">
> 
>    {
> 
>      "@context": "http://schema.org",
> 
>      "@type": "Country",
> 
>      "@id": "http://example.com/country/{#country-code}",
> 
>      "name": "{#country-name}"
> 
>    }
> 
>    </script>
> 
>    The only difference with the CSV table case is that columns are
>    referenced via the HTML identifiers in the table definition, instead of
>    URL fragments for CSV. Compared to the direct mark-up approach we
>    described at the beginning of the document, annotation through
>    references to columns with ids is more flexible.
> 
>                               Related Vocabularies
> 
>      * The RDF DataCube vocabulary
>        [18]http://www.w3.org/TR/2013/WD-vocab-data-cube-20130312/
>      * Direct mapping from RDB to RDF
>        [19]http://www.w3.org/TR/rdb-direct-mapping/
> 
>      * Customized mapping from RDB to RDF
>        [20]http://www.w3.org/TR/r2rml/
> 
>      * Defining n-ary relations on the semantic Web
>        [21]http://www.w3.org/TR/swbp-n-aryRelations/
>      * JSON-table
>        [22]http://www.dataprotocols.org/en/latest/json-table-schema.html
> 
>      * Linked CSV
>        [23]http://jenit.github.io/linked-csv/
> 
> References
> 
>    Visible links
>    1. http://www.w3.org/TR/r2rml/
>    2. http://en.wikipedia.org/wiki/List_of_paintings_by_Rembrandt
>    3. http://en.wikipedia.org/wiki/Stedelijk_Museum_De_Lakenhal
>    4. http://schema.org/Painting
>    5. http://en.wikipedia.org/wiki/File:Rembrandt_van_Rijn_181.jpg
>    6. http://en.wikipedia.org/wiki/Rembrandt
>    7. http://tools.ietf.org/html/draft-hausenblas-csv-fragment-01
>    8. http://json-ld.org/spec/latest/json-ld
>    9. http://json-ld.org/spec/latest/json-ld/#embedding-json-ld-in-html-documents
>   10. http://wp.org/rembrandt-paintings.csv#col:Title
>   11. http://en.wikipedia.org/wiki/Rembrandt
>   12. http://tools.ietf.org/html/rfc6570
>   13. http://en.wikipedia.org/wiki/Rembrandt
>   14. http://json-ld.org/spec/latest/json-ld/#compact-iris
>   15. http://schema.org/Dataset
>   16. http://tools.ietf.org/html/rfc6570
>   17. https://en.wikipedia.org/wiki/Database_normalization
>   18. http://www.w3.org/TR/2013/WD-vocab-data-cube-20130312/
>   19. http://www.w3.org/TR/rdb-direct-mapping/
>   20. http://www.w3.org/TR/r2rml/
>   21. http://www.w3.org/TR/swbp-n-aryRelations/
>   22. http://www.dataprotocols.org/en/latest/json-table-schema.html
>   23. http://jenit.github.io/linked-csv/
> 
>    Hidden links:
>   24. file://localhost/tmp/Lookinginsidetables.html
>   25. file://localhost/tmp/Lookinginsidetables.html
>   26. file://localhost/tmp/Lookinginsidetables.html
>   27. file://localhost/tmp/Lookinginsidetables.html
>   28. file://localhost/tmp/Lookinginsidetables.html
>   29. file://localhost/tmp/Lookinginsidetables.html
>   30. file://localhost/tmp/Lookinginsidetables.html
>   31. file://localhost/tmp/Lookinginsidetables.html
>   32. http://publishing-statistical-data.googlecode.com/svn/trunk/specs/src/main/html/cube.html
>   33. http://www.w3.org/TR/rdb-direct-mapping/
>   34. http://www.w3.org/TR/r2rml/

> _______________________________________________
> okfn-labs mailing list
> okfn-labs at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/okfn-labs
> Unsubscribe: http://lists.okfn.org/mailman/options/okfn-labs

-- 
Data Diva | skype: mihi_tr | @mihi_tr
The Open Knowledge Foundation | School of Data
http://okfn.org | http://schoolofdata.org 
GPG/PGP key: http://tentacleriot.eu/mihi.asc