[okfn-labs] JSON table schema + CSV

Tryggvi Björgvinsson tryggvi.bjorgvinsson at okfn.org
Wed Dec 3 09:53:09 UTC 2014


Yes I agree with you on that. As I said I didn't like this approach but
that's the "only" approach I see possible based on the spec if we want
to validate this.

I think it's best to skip the validation if the source is a CSV (and
maybe raise a warning). The only one which I think might work out is the
"lon, lat" convention for a geopoint.

/Tryggvi

On mið 3.des 2014 09:47, Paul Walsh wrote:
> Ok, I will create an issue on data protocols.
>
> The thing I don’t like about the suggestion to use actual json in the
> fields is that, while it might be fine as the default, it is not very
> friendly for humans. So, if we consider that (some) CSVs are partially
> or wholly manually constructed, it *might* be less prone to error in
> data entry to describe an object as:
>
> “name=Jane,age=28”
>
> rather than:
>
> {“name”: “Jane”, “age”: 28} 
>
> Imagine the validation errors from trailing commas, inconsistent or
> incorrect use of quotes, etc.
>  
>
>> On 3 Dec 2014, at 11:36, Tryggvi Björgvinsson
>> <tryggvi.bjorgvinsson at okfn.org
>> <mailto:tryggvi.bjorgvinsson at okfn.org>> wrote:
>>
>> Here are my two cents on this.
>>
>> I agree that the spec is to vague on this.
>>
>> I would as much as possible try to avoid adding a new delimiter in
>> the CSV. I would at least not expect a validator to enforce an
>> unwritten rule onto CSVs. Since the spec doesn't name | or ** or
>> whatever as a delimiter I don't think we can validate against it.
>>
>> The spec does give some examples:
>>
>>>   * *object*: (alias json) an JSON-encoded object
>>>  *
>>>
>>>     *geopoint*: has one of the following structures:
>>>
>>>     |{ lon: ..., lat: ... }
>>>
>>>     [lon,lat]
>>>
>>>     "lon, lat"
>>>     |
>>>   * *geojson*: as per <<http://geojson.org/>>
>>>   * *array*: an array
>>>
>>
>> I think the geopoint is  the simplest of these. You can just encode
>> according to the structure in the csv (even if it is weird):
>>
>> name,home
>> Paul,"32.0934, 34.7841"
>>
>> You could also incorporate json in the csv althought that looks very
>> ugly and makes me shudder:
>>
>> name,home
>> Paul,"[32.0934, 34.7841]"
>>
>> or
>>
>> name,home
>> Paul,"{lon:32.0934, lat:34.7841}"
>>
>> That said, this would then also work for object (or json if you use
>> the alias). It would also work for geojson and the array could be
>> represented as a json array.
>>
>> Like I said, I don't like mixing these two and I wish the spec was
>> clearer on this instead of just throwing it out there but following
>> this is the only thing I feel can be done without the validator
>> becoming to specific by defining its own rules (which it shouldn't).
>>
>> Perhaps it is worth raising this in the issue tracker for
>> dataprotocols: https://github.com/dataprotocols/dataprotocols
>>
>> /Tryggvi
>>
>> On mið 3.des 2014 09:00, Paul Walsh wrote:
>>> Hi James,
>>>
>>> Yes, I am definitely aware of CSVLint and it is a great project. I’m
>>> doing a Python implementation of schema validation and some other
>>> modular components for use in the OS ecosystem.
>>>
>>> As you say, it does not appear to handle the complex data types in
>>> the spec.
>>>
>>> I’ll provide example if what I want to support:
>>>
>>> # people.csv with geopoint as array
>>> first_name,home
>>> Paul,32.0934|34.7841
>>>
>>> # alt. people.csv with geopoint as object
>>> first_name,home
>>> Paul,lat=32.0934**lon=34.7841
>>>
>>> # schema.csv
>>> {
>>>     “fields”: [
>>>         {“name”: “first_name”, type: “string”},
>>>         {“name”: “home”, “type”: “geopoint"}
>>>     ]
>>> }
>>>
>>> Geopoint is an array, so the linter or validator would need to know
>>> how to convert “home” into an array, in order for CSV files to be
>>> able to have instances of all types described in the spec.
>>>
>>> Obviously some objects (e.g.: polygons) would be very unwieldy if
>>> represented in CSV like this: a reference property would likely be a
>>> better option in some cases. 
>>>
>>>
>>>> On 3 Dec 2014, at 10:37, James Smith <james at floppy.org.uk
>>>> <mailto:james at floppy.org.uk>> wrote:
>>>>
>>>> Hi Paul,
>>>>
>>>> Have you come across our work with http://csvlint.io?
>>>> <http://csvlint.io/?> It validates CSV against JSON Table Schema
>>>> (in fact, Tabular Data Package) as you describe, though I don’t
>>>> *think* we delve into the complex types yet that you mention. The
>>>> core validation code is in a ruby gem
>>>> at https://github.com/theodi/csvlint.rb, and we’re always open to
>>>> improvements, so if you’re interested in adding to that, we’d love
>>>> to get more people working on it :)
>>>>
>>>> For your question, I think a full-featured validator should check
>>>> that the fields match what they are supposed to be. For instance, a
>>>> field listed as GeoJSON or geopoint should be checked that it’s
>>>> structure is correct. As for array, yes, the spec seems vague on
>>>> that. Perhaps the spec should simply state that it should be a JSON
>>>> array as for the types above? 
>>>>
>>>> I know the CSV on the Web working group are looking at this stuff
>>>> as well -
>>>> see https://w3c.github.io/csvw/ and https://github.com/w3c/csvw,
>>>> but I can’t see anything in the current docs talking about data
>>>> types - I suspect that’s left to higher-level standards above CSV
>>>> like TDP.
>>>>
>>>> cheers,
>>>> James Smith
>>>> Open Data Institute
>>>>
>>>>> On 3 Dec 2014, at 07:46, Paul Walsh <paulywalsh at gmail.com
>>>>> <mailto:paulywalsh at gmail.com>> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> I’m working on a JSON table schema validator (spec
>>>>> <http://dataprotocols.org/json-table-schema/>). 
>>>>>
>>>>> My original intention was to port this Node implementation
>>>>> <https://github.com/okfn/json-table-schema-validator> to Python,
>>>>> but on closer inspection, the Node module does not cover enough of
>>>>> the spec, so I’m no longer “porting”, but writing an
>>>>> implementation using that as an existing example of one.
>>>>>
>>>>> My goal is to fully cover the spec, and my primary use case right
>>>>> now is validating CSV files against JSON table schemas. 
>>>>>
>>>>> CSV as the data source raises issues with several of the types in
>>>>> the spec whose representation is object or array (object/json,
>>>>> array, geopoint, geojson). I’m not aware of any implementations
>>>>> that handle this (correct me if I’m wrong). 
>>>>>
>>>>> I see two directions:
>>>>>
>>>>> 1. Don’t try to handle these types when source is CSV (e.g.: A CSV
>>>>> source could not have a field that is type geopoint)
>>>>> 2. Have a spec that describes how implementations MAY parse a CSV
>>>>> field as object or array, pre-validation. Something like:
>>>>>     * TO_ARRAY (INTRAFIELD_SEPARATOR = '|’), e.g.: value|value|value
>>>>>     * TO_OBJECT (INTRAFIELD_SEPARATOR = '**',
>>>>> INTRAFIELD_ASSIGNMENT = '='): e.g.: key=value**key=value**key=value
>>>>>
>>>>>
>>>>> Any thoughts?
>>>>>
>>>>> Paul
>>>>> _______________________________________________
>>>>> okfn-labs mailing list
>>>>> okfn-labs at lists.okfn.org <mailto:okfn-labs at lists.okfn.org>
>>>>> https://lists.okfn.org/mailman/listinfo/okfn-labs
>>>>> Unsubscribe: https://lists.okfn.org/mailman/options/okfn-labs
>>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> okfn-labs mailing list
>>> okfn-labs at lists.okfn.org
>>> https://lists.okfn.org/mailman/listinfo/okfn-labs
>>> Unsubscribe: https://lists.okfn.org/mailman/options/okfn-labs
>>
>> _______________________________________________
>> okfn-labs mailing list
>> okfn-labs at lists.okfn.org <mailto:okfn-labs at lists.okfn.org>
>> https://lists.okfn.org/mailman/listinfo/okfn-labs
>> Unsubscribe: https://lists.okfn.org/mailman/options/okfn-labs
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/okfn-labs/attachments/20141203/74fbd1d0/attachment-0004.html>


More information about the okfn-labs mailing list