[Okfn-ca] Alternatives to OpenRefine | Fwd: School-of-data Digest, Vol 16, Issue 2

Thu Jul 11 11:58:50 UTC 2013

Hi Everybody,

> My personal preference is to use a scripting language for all ETL 
> work.  There is no bizarre corner case or integration problem that 
> cannot be easily dealt with a simple script.  Python is an obvious 
> choice: tasks that would be a hassle with a tool like OpenRefine or 
> in, say, Java  are a breeze, fast, and somewhat enjoyable to work on.

I agree with Peder here.

> ETL is my full time job, so I do grant that that not everyone has the 
> luxury to figure out all the tricks of data manipulation with 
> something like Python or Ruby.  But if possible, it's an investment 
> that is worth making, and will pay big dividends  over the long term 
> for any organization that needs to aggregate  data.

However I would make a distinction here. ETL stands for: Extract 
Transform Load. So, about Extration, I agree that most if not all the 
time scripting is probably the best way to go since you are not 
constrained with a rigid framework that has to work with the thousands 
of different data sources format that may be converted.

However, about the Transform and Load, this is another story and this is 
probably the most important part of the equation. You can transform and 
load the data in nearly anything. However, are there better choices and 
others? From my experience, there certainly are. What I would advocate 
is to take the time to learn about and understand the description 
framework in which all the data get transformed and loaded. This is 
crucial since what you are talking about here is to convert all kind of 
data sources, coming from all kind of infrastructures, and to share it 
with all kind of different organizations and groups. Better using the 
right framework right at the beginning.

For the ones that knows me, the obvious choice is RDF + OWL. This is a 
Resource Description Framework that is widely developed for more than a 
decade by the W3C and that has been implemented in hundred of open 
source and commercial applications. The specifications of the data 
format and all its surroundings technologies (such as the SPARQL 
querying language) have been developed, iterated and scrutinized by the 
best of the academicians and commercial enterprises of the World.

The premise here is that the RDF framework is flexible enough to be able 
to describe any kind of data that can be describe with any other data 
description framework that may currently exists. Because of this 
flexibility, this become the premium choice for a ETL since any kind of 
data can be converted into this canonical framework, and can eventually 
be exported back into its original format.

So, ETL is one thing, but which ETL is quite another. And the decision 
can be crucial.

Thanks

Fred

>
> Cheers,
>
> Peder Jakobsen
> Consultant, OKFN CKAN & data.gc.ca <http://data.gc.ca>
>
>
>
>
> _______________________________________________
> Okfn-ca mailing list | Le groupe local au Canada de l'Open Knowledge Foundation Network (OKFN)
> Okfn-ca at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/okfn-ca
> Site Web : http://ca.okfn.org
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/okfn-ca/attachments/20130711/381af27d/attachment-0001.html>