[okfn-discuss] CKAN getting spammed again
John Bywater
john.bywater at appropriatesoftware.net
Thu Feb 21 14:37:24 UTC 2008
Rufus Pollock wrote:
> CKAN is getting more 'backhanded compliments' in the form of spamming:
>
> http://www.ckan.net/revision/
> http://www.ckan.net/revision/read/1978
>
How charming. :-)
> Admins can keep purging this by hand [1] but this isn't hugely
> efficient. In fact around christmas to deal with a bad repeated attack I
> implemented some more sophisticated support ([2],[3]) including
> blacklisting but didn't fully integrate this into controllers. Before we
> go ahead and do more I wonder if anyone else has comments as to how best
> to deal with spam on 'world-editable' systems. In particular what are
> people's views on the effectiveness (and cost of implementation) of
> things like:
>
> * captchas (and texchas)
>
Very effective for screening out machine clients.
> * ip blacklisting
>
Not so effective over time. You just keep adding addresses, and they
just keep using new ones. Stops persistent morons, but how many of those
are there? :-)
> * bayesian spam filtering of some kind
>
Makes a strong contribution, with inevitable rate of error. Could work
off the blog spamming databases? Could be used to selectively prompt for
moderator approval.
One could think about this all day, and I'm sure the list is endless, as
it's effectively an arms race. However...
It appears that you win decisively by having a lot of people watching
the recent changes list all the time.
You can also reduce the incentives, for example by retarding what search
engines get to see by 12 hours, so you have a chance to remove the spam
before it gets indexed (the CKAN spammer's goal, probably). You might
not do this for trusted users, i.e. users who have made some changes and
have not added any spam. I would think there would be many variations on
the retarding trick. Also included, not leaving the spam as a visible
revision; CKAN does at the moment, right?
You could also have different operating modes, one might allow you to be
alerted about and then approve contributions. You might want to have
somebody cover this when you are away, or make this an OKF officer
position (humans are fairly good at classifying spam)? You might want to
be able to switch to "anything goes", or to a "lock-down" mode in case
it becomes under attack or if you simply feel snowed under by a steady
stream of spam, or if you need to leave it unattended for a period of
time? There would need to be some kind of catch-up notifications, so
contributors know when things have gone through, or the site is open
again. I suppose this is called moderation. :-) Given that CKAN is a
collection, it might also be called curation.
You can get Kreative, so you might approve changes that are made without
an OpenID. Or when there are more than two changes. Or always, unless
they are known and trusted. Or changes by somebody other than the
original registrant. Or when the contribution is classified as spam by a
pattern recognition function. Lots of possibilities, some may do more
harm than good, and doing lots of things would probably be a bad thing.
Overall, the main analysis would be to frustrate the ways in which CKAN
is actually spammed (whilst not destroying the purpose and openness).
And combine this with flexibility, so you can regulate access according
to prevailing conditions, much like a mailing list/blog/wiki.
I would guess that the CKAN spam is mostly occasional, by a human
client, and is intended to be picked up by Google et al. So I would:
1. forget about captchas,
2. think about snagging for approval by a moderator contributions which
look like spam to an automaton,
3. add a "this as spam" button which prevents anybody from seeing the
spam from that point forward (maybe you need to support people applying
to see hidden spam, so there is no suspicion about censoring real
content, maybe obfuscated from search engines by Javascript),
4. maybe combine this with search engine detection and presentation to
them of a retarded view of the registry.
That shouldn't cost too much, would cut out most of the spam, and would
cut out most of the admin work. It wouldn't have become a "moderated"
resource, just "actively monitored".
How much does purging CKAN spam by hand "cost" the OKF? Is most of the
spam light and irregular?
Best wishes,
John.
> ~rufus
>
> [1]:<http://lists.okfn.org/pipermail/okfn-help/2007-October/000038.html>
> [2]:<http://knowledgeforge.net/ckan/trac/changeset/205>
> [3]:<http://knowledgeforge.net/ckan/trac/changeset/202>
>
More information about the okfn-discuss
mailing list