[okfn-discuss] CKAN getting spammed again

Rufus Pollock rufus.pollock at okfn.org
Fri Feb 22 18:23:57 UTC 2008


John Bywater wrote:
[snip]

>>    * captchas (and texchas)
>>   
> 
> Very effective for screening out machine clients.

Do you mean text-clients (e.g. lynx or w3m) of you mean machine clients 
(e.g. users of a web api)? In the second case I don't think there is a 
problem since this group would be given an API key and not shown the 
captchas.

>>    * ip blacklisting
>>   
> 
> Not so effective over time. You just keep adding addresses, and they 
> just keep using new ones. Stops persistent morons, but how many of those 
> are there? :-)

Yes, I'd agree here. It might be useful combined with other methods e.g. 
throttling posts from the same IP but looking at the recent attack it is 
clear the IP varied quite substantially.

>>    * bayesian spam filtering of some kind
>>   

I should also add the distinct, but related, simple text blacklist where 
you list textstrings (particularly urls) that are blacklisted. This is 
used in e.g. movable type, moinmoin ...

> Makes a strong contribution, with inevitable rate of error. Could work 
> off the blog spamming databases? Could be used to selectively prompt for 
> moderator approval.
> 
> One could think about this all day, and I'm sure the list is endless, as 
> it's effectively an arms race. However...
> 
> It appears that you win decisively by having a lot of people watching 
> the recent changes list all the time.
> 
> You can also reduce the incentives, for example by retarding what search 
> engines get to see by 12 hours, so you have a chance to remove the spam 
> before it gets indexed (the CKAN spammer's goal, probably). You might 
> not do this for trusted users, i.e. users who have made some changes and 
> have not added any spam. I would think there would be many variations on 
> the retarding trick. Also included, not leaving the spam as a visible 
> revision; CKAN does at the moment, right?

I'm sceptical as to what effects this has on the spammers since I doubt 
they bother to check what you put up or not.

> You could also have different operating modes, one might allow you to be 
> alerted about and then approve contributions. You might want to have 
> somebody cover this when you are away, or make this an OKF officer 
> position (humans are fairly good at classifying spam)? You might want to 
> be able to switch to "anything goes", or to a "lock-down" mode in case 
> it becomes under attack or if you simply feel snowed under by a steady 
> stream of spam, or if you need to leave it unattended for a period of 
> time? There would need to be some kind of catch-up notifications, so 
> contributors know when things have gone through, or the site is open 
> again. I suppose this is called moderation. :-) Given that CKAN is a 
> collection, it might also be called curation.

This is a very good point. Basically one does need volunteers to help 
monitor, curate contributions etc etc.

> You can get Kreative, so you might approve changes that are made without 
> an OpenID. Or when there are more than two changes. Or always, unless 
> they are known and trusted. Or changes by somebody other than the 
> original registrant. Or when the contribution is classified as spam by a 
> pattern recognition function. Lots of possibilities, some may do more 
> harm than good, and doing lots of things would probably be a bad thing.

Basically we have a general approval function:

def approve_changes(change, context):
     ...

And this could make use of various plugins aggregating up to a 
particular score. This has some similarities to the approach taken in:

http://trac.edgewall.org/wiki/SpamFilter

[snip]

> I would guess that the CKAN spam is mostly occasional, by a human 
> client, and is intended to be picked up by Google et al. So I would:

I don't think it is done by a human but is automated. The volume has 
been substantial on occasion.

[snip]

> How much does purging CKAN spam by hand "cost" the OKF? Is most of the 
> spam light and irregular?

At present it is fairly light on CKAN (it is worse on things like the 
main wiki -- and we have also had issues on occassion on knowledgeforge) 
but overall it is currently fairly manageable.

~rufus




More information about the okfn-discuss mailing list