[Okfn-ca] Alternatives to OpenRefine | Fwd: School-of-data Digest, Vol 16, Issue 2
Frederick Giasson
fred at fgiasson.com
Thu Jul 11 11:58:50 UTC 2013
Hi Everybody,
> My personal preference is to use a scripting language for all ETL
> work. There is no bizarre corner case or integration problem that
> cannot be easily dealt with a simple script. Python is an obvious
> choice: tasks that would be a hassle with a tool like OpenRefine or
> in, say, Java are a breeze, fast, and somewhat enjoyable to work on.
I agree with Peder here.
> ETL is my full time job, so I do grant that that not everyone has the
> luxury to figure out all the tricks of data manipulation with
> something like Python or Ruby. But if possible, it's an investment
> that is worth making, and will pay big dividends over the long term
> for any organization that needs to aggregate data.
However I would make a distinction here. ETL stands for: Extract
Transform Load. So, about Extration, I agree that most if not all the
time scripting is probably the best way to go since you are not
constrained with a rigid framework that has to work with the thousands
of different data sources format that may be converted.
However, about the Transform and Load, this is another story and this is
probably the most important part of the equation. You can transform and
load the data in nearly anything. However, are there better choices and
others? From my experience, there certainly are. What I would advocate
is to take the time to learn about and understand the description
framework in which all the data get transformed and loaded. This is
crucial since what you are talking about here is to convert all kind of
data sources, coming from all kind of infrastructures, and to share it
with all kind of different organizations and groups. Better using the
right framework right at the beginning.
For the ones that knows me, the obvious choice is RDF + OWL. This is a
Resource Description Framework that is widely developed for more than a
decade by the W3C and that has been implemented in hundred of open
source and commercial applications. The specifications of the data
format and all its surroundings technologies (such as the SPARQL
querying language) have been developed, iterated and scrutinized by the
best of the academicians and commercial enterprises of the World.
The premise here is that the RDF framework is flexible enough to be able
to describe any kind of data that can be describe with any other data
description framework that may currently exists. Because of this
flexibility, this become the premium choice for a ETL since any kind of
data can be converted into this canonical framework, and can eventually
be exported back into its original format.
So, ETL is one thing, but which ETL is quite another. And the decision
can be crucial.
Thanks
Fred
>
> Cheers,
>
> Peder Jakobsen
> Consultant, OKFN CKAN & data.gc.ca <http://data.gc.ca>
>
>
>
>
> _______________________________________________
> Okfn-ca mailing list | Le groupe local au Canada de l'Open Knowledge Foundation Network (OKFN)
> Okfn-ca at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/okfn-ca
> Site Web : http://ca.okfn.org
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/okfn-ca/attachments/20130711/381af27d/attachment-0001.html>
More information about the okfn-ca
mailing list