[okfn-discuss] CKAN getting spammed again

Thu Feb 21 14:37:24 UTC 2008

Rufus Pollock wrote:
> CKAN is getting more 'backhanded compliments' in the form of spamming:
>
>    http://www.ckan.net/revision/
>    http://www.ckan.net/revision/read/1978
>   

How charming. :-)

> Admins can keep purging this by hand [1] but this isn't hugely 
> efficient. In fact around christmas to deal with a bad repeated attack I 
> implemented some more sophisticated support ([2],[3]) including 
> blacklisting but didn't fully integrate this into controllers. Before we 
> go ahead and do more I wonder if anyone else has comments as to how best 
> to deal with spam on 'world-editable' systems. In particular what are 
> people's views on the effectiveness (and cost of implementation) of 
> things like:
>
>    * captchas (and texchas)
>   

Very effective for screening out machine clients.

>    * ip blacklisting
>   

Not so effective over time. You just keep adding addresses, and they 
just keep using new ones. Stops persistent morons, but how many of those 
are there? :-)

>    * bayesian spam filtering of some kind
>   

Makes a strong contribution, with inevitable rate of error. Could work 
off the blog spamming databases? Could be used to selectively prompt for 
moderator approval.

One could think about this all day, and I'm sure the list is endless, as 
it's effectively an arms race. However...

It appears that you win decisively by having a lot of people watching 
the recent changes list all the time.

You can also reduce the incentives, for example by retarding what search 
engines get to see by 12 hours, so you have a chance to remove the spam 
before it gets indexed (the CKAN spammer's goal, probably). You might 
not do this for trusted users, i.e. users who have made some changes and 
have not added any spam. I would think there would be many variations on 
the retarding trick. Also included, not leaving the spam as a visible 
revision; CKAN does at the moment, right?

You could also have different operating modes, one might allow you to be 
alerted about and then approve contributions. You might want to have 
somebody cover this when you are away, or make this an OKF officer 
position (humans are fairly good at classifying spam)? You might want to 
be able to switch to "anything goes", or to a "lock-down" mode in case 
it becomes under attack or if you simply feel snowed under by a steady 
stream of spam, or if you need to leave it unattended for a period of 
time? There would need to be some kind of catch-up notifications, so 
contributors know when things have gone through, or the site is open 
again. I suppose this is called moderation. :-) Given that CKAN is a 
collection, it might also be called curation.

You can get Kreative, so you might approve changes that are made without 
an OpenID. Or when there are more than two changes. Or always, unless 
they are known and trusted. Or changes by somebody other than the 
original registrant. Or when the contribution is classified as spam by a 
pattern recognition function. Lots of possibilities, some may do more 
harm than good, and doing lots of things would probably be a bad thing.

Overall, the main analysis would be to frustrate the ways in which CKAN 
is actually spammed (whilst not destroying the purpose and openness). 
And combine this with flexibility, so you can regulate access according 
to prevailing conditions, much like a mailing list/blog/wiki.

I would guess that the CKAN spam is mostly occasional, by a human 
client, and is intended to be picked up by Google et al. So I would:

1. forget about captchas,
2. think about snagging for approval by a moderator contributions which 
look like spam to an automaton,
3. add a "this as spam" button which prevents anybody from seeing the 
spam from that point forward (maybe you need to support people applying 
to see hidden spam, so there is no suspicion about censoring real 
content, maybe obfuscated from search engines by Javascript),
4. maybe combine this with search engine detection and presentation to 
them of a retarded view of the registry.

That shouldn't cost too much, would cut out most of the spam, and would 
cut out most of the admin work. It wouldn't have become a "moderated" 
resource, just "actively monitored".

How much does purging CKAN spam by hand "cost" the OKF? Is most of the 
spam light and irregular?

Best wishes,

John.

> ~rufus	
>
> [1]:<http://lists.okfn.org/pipermail/okfn-help/2007-October/000038.html>
> [2]:<http://knowledgeforge.net/ckan/trac/changeset/205>
> [3]:<http://knowledgeforge.net/ckan/trac/changeset/202>
>