[okfn-labs] JSON table schema + CSV
Paul Walsh
paulywalsh at gmail.com
Wed Dec 3 09:47:20 UTC 2014
Ok, I will create an issue on data protocols.
The thing I don’t like about the suggestion to use actual json in the fields is that, while it might be fine as the default, it is not very friendly for humans. So, if we consider that (some) CSVs are partially or wholly manually constructed, it *might* be less prone to error in data entry to describe an object as:
“name=Jane,age=28”
rather than:
{“name”: “Jane”, “age”: 28}
Imagine the validation errors from trailing commas, inconsistent or incorrect use of quotes, etc.
> On 3 Dec 2014, at 11:36, Tryggvi Björgvinsson <tryggvi.bjorgvinsson at okfn.org> wrote:
>
> Here are my two cents on this.
>
> I agree that the spec is to vague on this.
>
> I would as much as possible try to avoid adding a new delimiter in the CSV. I would at least not expect a validator to enforce an unwritten rule onto CSVs. Since the spec doesn't name | or ** or whatever as a delimiter I don't think we can validate against it.
>
> The spec does give some examples:
>
>> object: (alias json) an JSON-encoded object
>> geopoint: has one of the following structures:
>>
>> { lon: ..., lat: ... }
>>
>> [lon,lat]
>>
>> "lon, lat"
>> geojson: as per <<http://geojson.org/> <http://geojson.org/>>
>> array: an array
>
> I think the geopoint is the simplest of these. You can just encode according to the structure in the csv (even if it is weird):
>
> name,home
> Paul,"32.0934, 34.7841"
>
> You could also incorporate json in the csv althought that looks very ugly and makes me shudder:
>
> name,home
> Paul,"[32.0934, 34.7841]"
>
> or
>
> name,home
> Paul,"{lon:32.0934, lat:34.7841}"
>
> That said, this would then also work for object (or json if you use the alias). It would also work for geojson and the array could be represented as a json array.
>
> Like I said, I don't like mixing these two and I wish the spec was clearer on this instead of just throwing it out there but following this is the only thing I feel can be done without the validator becoming to specific by defining its own rules (which it shouldn't).
>
> Perhaps it is worth raising this in the issue tracker for dataprotocols: https://github.com/dataprotocols/dataprotocols <https://github.com/dataprotocols/dataprotocols>
>
> /Tryggvi
>
> On mið 3.des 2014 09:00, Paul Walsh wrote:
>> Hi James,
>>
>> Yes, I am definitely aware of CSVLint and it is a great project. I’m doing a Python implementation of schema validation and some other modular components for use in the OS ecosystem.
>>
>> As you say, it does not appear to handle the complex data types in the spec.
>>
>> I’ll provide example if what I want to support:
>>
>> # people.csv with geopoint as array
>> first_name,home
>> Paul,32.0934|34.7841
>>
>> # alt. people.csv with geopoint as object
>> first_name,home
>> Paul,lat=32.0934**lon=34.7841
>>
>> # schema.csv
>> {
>> “fields”: [
>> {“name”: “first_name”, type: “string”},
>> {“name”: “home”, “type”: “geopoint"}
>> ]
>> }
>>
>> Geopoint is an array, so the linter or validator would need to know how to convert “home” into an array, in order for CSV files to be able to have instances of all types described in the spec.
>>
>> Obviously some objects (e.g.: polygons) would be very unwieldy if represented in CSV like this: a reference property would likely be a better option in some cases.
>>
>>
>>> On 3 Dec 2014, at 10:37, James Smith <james at floppy.org.uk <mailto:james at floppy.org.uk>> wrote:
>>>
>>> Hi Paul,
>>>
>>> Have you come across our work with http://csvlint.io? <http://csvlint.io/?> It validates CSV against JSON Table Schema (in fact, Tabular Data Package) as you describe, though I don’t *think* we delve into the complex types yet that you mention. The core validation code is in a ruby gem at https://github.com/theodi/csvlint.rb <https://github.com/theodi/csvlint.rb>, and we’re always open to improvements, so if you’re interested in adding to that, we’d love to get more people working on it :)
>>>
>>> For your question, I think a full-featured validator should check that the fields match what they are supposed to be. For instance, a field listed as GeoJSON or geopoint should be checked that it’s structure is correct. As for array, yes, the spec seems vague on that. Perhaps the spec should simply state that it should be a JSON array as for the types above?
>>>
>>> I know the CSV on the Web working group are looking at this stuff as well - see https://w3c.github.io/csvw/ <https://w3c.github.io/csvw/> and https://github.com/w3c/csvw <https://github.com/w3c/csvw>, but I can’t see anything in the current docs talking about data types - I suspect that’s left to higher-level standards above CSV like TDP.
>>>
>>> cheers,
>>> James Smith
>>> Open Data Institute
>>>
>>>> On 3 Dec 2014, at 07:46, Paul Walsh <paulywalsh at gmail.com <mailto:paulywalsh at gmail.com>> wrote:
>>>>
>>>> Hi,
>>>>
>>>> I’m working on a JSON table schema validator (spec <http://dataprotocols.org/json-table-schema/>).
>>>>
>>>> My original intention was to port this Node implementation <https://github.com/okfn/json-table-schema-validator> to Python, but on closer inspection, the Node module does not cover enough of the spec, so I’m no longer “porting”, but writing an implementation using that as an existing example of one.
>>>>
>>>> My goal is to fully cover the spec, and my primary use case right now is validating CSV files against JSON table schemas.
>>>>
>>>> CSV as the data source raises issues with several of the types in the spec whose representation is object or array (object/json, array, geopoint, geojson). I’m not aware of any implementations that handle this (correct me if I’m wrong).
>>>>
>>>> I see two directions:
>>>>
>>>> 1. Don’t try to handle these types when source is CSV (e.g.: A CSV source could not have a field that is type geopoint)
>>>> 2. Have a spec that describes how implementations MAY parse a CSV field as object or array, pre-validation. Something like:
>>>> * TO_ARRAY (INTRAFIELD_SEPARATOR = '|’), e.g.: value|value|value
>>>> * TO_OBJECT (INTRAFIELD_SEPARATOR = '**', INTRAFIELD_ASSIGNMENT = '='): e.g.: key=value**key=value**key=value
>>>>
>>>>
>>>> Any thoughts?
>>>>
>>>> Paul
>>>> _______________________________________________
>>>> okfn-labs mailing list
>>>> okfn-labs at lists.okfn.org <mailto:okfn-labs at lists.okfn.org>
>>>> https://lists.okfn.org/mailman/listinfo/okfn-labs <https://lists.okfn.org/mailman/listinfo/okfn-labs>
>>>> Unsubscribe: https://lists.okfn.org/mailman/options/okfn-labs <https://lists.okfn.org/mailman/options/okfn-labs>
>>>
>>
>>
>>
>> _______________________________________________
>> okfn-labs mailing list
>> okfn-labs at lists.okfn.org <mailto:okfn-labs at lists.okfn.org>
>> https://lists.okfn.org/mailman/listinfo/okfn-labs <https://lists.okfn.org/mailman/listinfo/okfn-labs>
>> Unsubscribe: https://lists.okfn.org/mailman/options/okfn-labs <https://lists.okfn.org/mailman/options/okfn-labs>
>
> _______________________________________________
> okfn-labs mailing list
> okfn-labs at lists.okfn.org
> https://lists.okfn.org/mailman/listinfo/okfn-labs
> Unsubscribe: https://lists.okfn.org/mailman/options/okfn-labs
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/okfn-labs/attachments/20141203/48857147/attachment-0004.html>
More information about the okfn-labs
mailing list