[ckan-discuss] Clear scoping of requirements

Tue Sep 7 22:42:27 BST 2010

On 10-09-07 00:00, Jo Walsh wrote:
>
> I doubt anyone who is really familiar with the CKAN codebase will
> contradict me when i suggest that it needs re-engineering from the
> inside out.

I guess there are two issues here. The minor one is, like just
about any significant piece of software there is code cleanup
and refactoring that could stand to be done. Also like with
any software there will be different opinions about *how* this
should be done. I have never seen a piece of web-centric software
and had the impression, "wow that is beautiful". I have worked
things built with hand-crafted perl, NeXT WebObjects, Zope,
Drupal, Django, and have always encountered a goodly dose of
spaghetti. CKAN and Pylons actually measure up favourably in my opinion.
Even the Linux kernel is not entirely pretty (the BSD
kernel actually is). As I said to John Bywater the other day, I
have seen much worse but can imagine much better.

There are two larger issues. The first is storing tables of
key-value pairs in the SQL database. This kind of thing will
make any DBA raised on the 3rd normal form's hair stand on end.
It is an ugly tradeoff to avoid modifying the database schema
whenever attributes (columns) might vary or can't be known in
advance. Drupal does it, in fact just about any non-trivial
data management system using a SQL database that allows the
user any discretion in terms of the attributes of things has
to do it or live with schema changes that will diverge when the
app is installed in more than one place.

The second is when we start representing relationships between
objects (in this case packages). The resulting hierarchy is not
something that is handled well by relational databases. You
can put the data in but querying it, traversing the tree, is
unweildy. Again here, nothing that is based on a SQL paradigm
does a good job at this.

These two larger problems are areas where RDF shines. I definitely think
that a longer term strategy is to migrate to a triplestore
back-end instead of a SQL database. Beyond that very interesting
possibilities in terms of interpreting, correlating, inferencing
over the data open up.

This would mean creating a truly next generation web application.
Most existing applications, such as Drupal and the current CKAN
have support for RDF in what might, somewhat unfairly, called
"legacy mode". They have their rigid data models but can
export them, or express them as RDF so that other programs that
can interpret the RDF can do things with it. A "next generation
CKAN" would be among the first real-world native RDF applications.
It's not quite uncharted territory, there have been enough
experiments in this direction by OKF and others that I think
we have a handle on what it should look like -- it's not "pure
research" anymore, but there are still unanswered questions and
there will inevitably be bumps in the road and pitfalls along
the way.

It's a big job and definitely agree that we should be thinking
about it and discussing how best it could be done. I also think
that it shouldn't be rushed and that we absolutely must make sure
there is a clear migration path from what we have now and a
seamless transition for the users.

I'm sure we (OKF generally and CKAN developers in particular)
would very much appreciate input from the community about
whether and how we might go about this.

Cheers,
-w
-- 
William Waites           <william.waites at okfn.org>
Mob: +44 789 798 9965    Open Knowledge Foundation
Fax: +44 131 464 4948                Edinburgh, UK

RDF Indexing, Clustering and Inferencing in Python
		http://ordf.org/