[ckan-discuss] ckan.net linking open data group and lod cloud

Richard Cyganiak richard at cyganiak.de
Sun Apr 25 21:24:29 BST 2010

Hi Rufus,

(Adding ckan-discuss to cc list. Context: We are discussing wether it  
would be feasible to curate the dataset information that goes into  
making the LOD Cloud diagram [1] within ckan.net.)

On 20 Apr 2010, at 09:42, Rufus Pollock wrote:
>> Your message made me ponder wether we could actually maintain the  
>> data
>> behind the LOD cloud inside ckan.net. It's an intriguing idea. But  
>> I'm
>> afraid that the completely open wiki-style nature of CKAN wouldn't be
>> compatible with the way we work.
> You understand that groups are curated? That is, only group admins can
> add/remove packages from their group. (This is unlike a tag that
> anyone can add/remove)

Yes, I've seen the groups feature, and it would be useful here.

> That of course still leaves the packages themselves. I do note that
> ckan (and hence ckan.net) has full acl support:
> <http://wiki.okfn.org/ckan/doc/accesscontrol/>

Oh, nice. I hadn't seen this.

> Going forward we do anticipate a time when package's start having
> dedicated maintainers who may wish to switch off "anyone can edit"
> capability (that facility is in there currently but atm we discourage
> people from using that capability).

Ok, so one option would be for Anja and me to create the packages that  
represent the LOD Cloud as locked-down access controlled packages, and  
create a group that contains all of them.

The problem with this is that many of these packages already exist in  
ckan.net/group/lod and we would either have to duplicate those, or  
lock down packages that are currently editable by anyone. Would any of  
those options be acceptable to CKAN? They don't seem too much in the  
spirit of the site ...

>> ckan.net would need to have at least a discussion page for each  
>> dataset.
> This sounds very like a comments feature we've been discussing for a
> while.

By "discussion page" I mean something like the "talk pages" on  
Wikipedia, where meta-discussions around shaping the entry could take  
place. I'm thinking of discussion like:

A: "I think the license is stated incorrectly here, accoring to http://foobar 
  this dataset is in the public domain"
B: "The source for the currently stated license is here: http://xyz"
A: "That page is clearly outdated, and here's a post from John Doe  
that confirms public domain license ..."
B: "Ok, I updated the record"

A Comments feature is not good for this IMO, it would just add a lot  
of noise to the CKAN catalog pages.

The key idea at Wikipedia is that they separate the artifact (the  
Wikipedia article) from the discussion that shapes the artifact (the  
separate talk pages). This keeps the artifact page free from process  
discussion, and reduces resistance that users might have against  
posting on the highly visible artifact page.

> The other thing we've thought about is the ability to having
> pending changes -- i.e. someone makes a change to package and it
> doesn't get applied immediately but put on a stack waiting for admin
> approval (this is rather like a patch queue).

This is probably the right model for an environment where strict  
quality control is necessary (e.g., program code, where a bad change  
breaks the build, causes bugs, and is a security problem). In a wiki,  
a spirit of "just do it, it can always be reverted" is more  
appropriate. Now, CKAN is somewhere in the middle -- currently  
probably closer to a wiki with imposed structure, but in the long term  
you probably want much more automated processing to happen around your  
data. So the patch queue model might be the right one.

>> Without this, I'd be afraid that anyone could just come in and mess  
>> things
>> up, and I'd have to chase them down out-of-band. The site would  
>> also need
>> group- and dataset-level watchlists. This would give me reassurance  
>> that I
>> myself -- and hopefully a few other folks -- would look over the  
>> changes and
>> revert or fix/improve anything that's not good enough.
> There's already an atom feed (and API) with all changes.

The feed with all changes does not work for me, because only a small  
fraction of those changes will be relevant to myself, and I already  
have too much noise in my inbox.

> We've also just added ability to get package specific feeds ...

I don't want to manually subscribe to an additional feed whenever I  
add a package to a group.

Feeds for groups, please! This is a tool that group admins really  
need, IMO.

I wonder wether I can filter only the stuff for one group from the all- 
changes feed with Yahoo Pipes...

>> Or do you think my worries are unfounded here?
> I think at a start you could see what happened :) So far vandalism and
> spam have been kept very effectively in check and I think, in general,
> most edits to packages you were curating would be useful. At the same
> time see comments above for features that may already do some of what
> you want ...

Well. So let's say: If I can get a per-group feed, and a separate lod- 
cloud group (with Anja and me as initial group managers, more  
welcome), then I'm in, using unrestricted editing for the packages. So  
the per-group feed is the one extra thing I need to be confident that  
the curated data is reasonably safe.

I'll try building that feed with Yahoo Pipes. If that doesn't work,  
then I'll probably have to wait until you implement group feeds  

> I've already thought that we could start using agreed prefixes in
> ckan.net extras fields as a way of storing RDF info (which can then be
> proper RDF on semantic.ckan.net) --

Are extra fields in the RDF output at all? If they are, then I don't  
worry too much about this. I'd be willing to code something that runs  
CONSTRUCT queries or some other processing to get voiD data out of the  
extra fields.

A related question. In the LOD Cloud data, we track links between  
datasets. This could be done in CKAN using an extra field "links to",  
where the value is some identifier for the target dataset, e.g., its  
CKAN page. Now the problem is, sometimes we also want to keep track of  
the number of links between two given datasets. For example, dataset5  
links to dataset23 and there are 50k links between them. Do you have  
an idea how to represent this in CKAN? Could this be recorded using  
some convention with extra fields? Again, I don't have a problem doing  
post-processing that turns this into the final format -- just don't  
want to abuse the CKAN schema too much.

> also, before you ask, we have been
> thinking hard about moving ckan.net to a full RDF store backend

I don't really care what you use under the hood as long as there's  
some RDF on the surface. If you have a fixed schema, there's limited  
value in moving the backend to an RDF store.

So, let me know if we can get our own group, and I'll try the Pipes  
thing, and if both work out then I'd be happy to migrate the LOD Cloud  
database into CKAN.

All the best,

Linked Data Technologist • Linked Data Research Centre
Digital Enterprise Research Institute (DERI), NUI Galway, Ireland

More information about the ckan-discuss mailing list