[Okfn-ca] Introduction to CSV on data.okfn.org (Rufus Pollock) - Fwd: okfn-labs Digest, Vol 31, Issue 20

Ian Ward ian at excess.org
Mon Aug 26 12:16:01 UTC 2013


On Sun, Aug 25, 2013 at 9:56 PM, Rufus Pollock <rufus.pollock at okfn.org> wrote:
> On 25 August 2013 01:38, Ian Ward <ian at excess.org> wrote:
>> My favourite text format for streaming structured data these days is "JSON
>> Lines" (.jl). It can do tables, nested structures, simple data types,
>> unicode-stored-as-ascii and is almost completely unambiguous.
>
> I like JSON lines but average-person tools don't support it (e.g.

Good point. This makes me want to write a .jl <-> .xlsx converter. A
really basic implementation that just handles tables would be trivial.
Representing nested structures in excel would be more fun, maybe use
vertically joined cells to indicate nesting?

Even with CSVs in excel it's really easy to get it wrong. It seems
that Excel can't auto-detect delimiters reliably, and you always need
to manually select the encoding. I've never seen it default to UTF-8.

> spreadsheets). Also do you know of an actual "spec" (even very rought) or is
> it just. E.g. "object gets its own line"

I haven't seen an official spec. If one doesn't exist why don't we
write an RFC? A simple version of the spec could fit on a business
card the way the JSON spec does:

<valid JSON w/o newlines>[(<newline sequence><valid JSON w/o
newlines>)...][<newline sequence>]

The file encoding MUST be ASCII or UTF-8, the newline sequence SHOULD
be "\r\n" but "\r" and "\n" alone MUST also be accepted. The trailing
newline sequence SHOULD be present but files missing the trailing
newline sequence MUST be processed correctly.

Records may be referenced by line numbers starting from 1, the same as
used by most text processing tools.

Stream compression SHOULD be used for large files, such as gzip or
bzip2 to create .jl.gz or .jl.bz2 files. PkZip, tar or other tools
that compress multiple files SHOULD NOT be used for single JSON lines
files as they make it harder to process the data as a stream.

Ian




More information about the okfn-ca mailing list