[ckan-dev] moderated edits c(r)ep. http://trac.ckan.org/ticket/1129

David Raznick kindly at gmail.com
Fri May 13 00:32:41 UTC 2011


Firstly, the plan on pending changes should be simpler that what was
outlined in that thread.   4 states seem adequate.

* 'active'  (current approved)
* 'pending-approval'  (changes that have not been looked at by the
moderator)
* 'completed-approval' (changes that have been looked at by the moderator)
* 'deleted'

Only the moserator/sysadmin can delete.   The moderators can go through the
pending changes and apply them as they see fit.  When they apply them the
revision gets marked as completed-approval and a new active rows are made.

The continuity object just holds the latest change independent of the state
above.  This may cause some difficulty to easily query the database for
'active' revisions but *most* of the time we would hope that the community
edits would be accepted, so it seems that the latest edit in the continuity
table makes sense.

One big feeling I have on this is that we should resolve #1077 as a
> pre-requisite for this. I've just done a *big* revision of #1077 and
> created tickets detailing proposed changes to vdm. I think you may now
> agree with the conclusion that is there :-)
>

I actually think that it needs to be changed less than this.  Just the
relationship handling of state needs to be removed.  I am happy with state
handling on a per table row basis.

> The changset model would work with the following provisos.
>
> * No foreign key can change.

Why? (not disagreeing just not sure i understand)
>

No foreign key can change if you want a way to query the revision history
easily.  You will have to join on a column in the change_object table. See
queries below.


> > * We are willing to do an extra join for each relationship we have. i.e
> the
> > join will firstly be to continuity object and then back to the
> change_object
> > table.
> > * No continuity object can get deleted.
>
> This I don't understand so again please clarify.

I'd assume the join
> would be lazy rather than an FK relationship (I think we *need* to

remove FKs from revision objects / changeobjects to continuity
> whatever route we go down).
>

Do we?  I do not think we do.  I am happy to keep the latest change in the
continuity table.  I do not think we should purge things out of the
continuity tables.


>
> > These I admit are not too unreasonable.  However, the querying due to the
> > extra join is more cumbersome.  It would most likely be slower too, due
> to
> > extra join and because you would be always joining back to the *very
> large*
> > change_object table.
>
> Yes, though one that was indexed.
>
> Rufus
>

It seems very difficult to make myself clear.

 I think the major misunderstanding is that my proposal 2 relies on
sqlalchemy very little.  We would need two write hand written sql for all
our data views.

I am certain this proposal does not need any change in the vdm to work
either, it is just a way of changing the queries to fit our needs.

We would need to change the way we *use* vdm and that's all (removing
stateful relations).  We can manually change the state 'active' flag to
'pending-approval' and the vdm will accept that and copy that to the
revision table.

To give an example of proposal 2.  Say we have the package dict example
earlier in the thread. Our objective to get the dict out under under the
following conditions.

* The latest active edit.  (last approved edit)
* The latest edit.
* All the unapproved edits (moderation queue).
* A point in history/time or a particular revision.

We would do this as follows, I will use querying technique "distinct on" as
discussed in the crep as its the clearest. I did not know it existed before
looking into this and its very useful.


***** Latest active edit. i.e what we get out currently. ******

We will first need to get out the correct package object. To do this we will
query.

QUERY 1
    select distinct on (id) id, timestamp, name, ... from
    package_revision where state = 'active' and id = 'fdsfs' order by
timestamp desc

('...' stands for other columns,  I will also assume the timestamp is cached
on all the revision tables for the sake of not writing the join to the
revision table each time)

We will then get out all the resources.  This is a many_to_many so we assume
there is a package_resource table.

QUERY 2
    select distinct on (resource.id) id, url, ... from
package_resource_revision
    join resource_revision on resource.id = package_resource.resource_id
    where package_resource_revision.timestamp < timestamp_got_above and
package_resource_revision.timestamp < timestamp_got_above
    order by package_resource.timestamp desc, resource.timestamp desc

We would then need to filter out the deleted objects in python.

***** The latest edit. *****

We can get this out from the continuity tables as we do currently.

***** All the unapproved edits. *****

We will need a list of all the unapproved edits to do this we need the
distinct revisions affecting all the tables.  We will get out these
revisions like we do currently. i.e
QUERY 3
   select revision_id, timestamp from package_revision where state =
'pending-approval' where id = ' fdsfs'
   union
   select revision_id, timestamp from package_resource_revision where state
= 'pending-approval' and package_id = ' fdsfs'
   union
   select revision_id, timestamp from package_resource_revision join
resource_revision on package_resource_revision.resource_id =
resource_revision.id where resource_revision.state = 'pending-approval' and
package_id = ' fdsfs'

This sorted by timestamp gives us our moderater queue. For each item in the
queue we will need to get out the package like so.

QUERY 4
   select distinct on (id) id, timestamp, name, ... from
   package_revision where id = 'fdsfs' where timestamp <=
timestamp_got_above  order by timestamp desc

and to get out the resources for each timestamp we can do the same QUERY 2
above.


***** A point in history/time or a particular revision. *****

This is essentially covered by QUERY 4 and QUERY 2 above at a particular
time.



These queries are very difficult to do with the changeset_objects tables as
we refer to foreign keys in the revision tables and these foreign keys are
in the change_objects.   If you can come up with a query strategy using the
changeset vdm I would be very interested to hear:)

That has probably not helped too much in understanding, but I hope it
clarifies my thinking on proposal 2 at least a bit.  I do not think proposal
2 is very possible with the changeset vdm model.

Proposal 1 is still an option but, as I said, if we do that I would want
a revisioning system as well (I think it would be a shame to have the
potential to loose any history at all).  The changeset vdm would fit
proposal 1 slightly better, but is not needed for it.

This whole issue or revisioning interests me very much and I have spent too
much time thinking about it.

Relational databases are not the best fit for revisioning there is a
mismatch to how they work on relations. Document databases are, but the
model is too rigid and cant be queried nicely.

Graph databases like neo4j i.e ones that have a dict per node and edge would
work wonders because we would effectively have the vdm changeset model
whilst keeping all the relationships still attached and easily traversed.  I
am not suggesting we use it though :)

David
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/ckan-dev/attachments/20110513/119c15fd/attachment-0001.html>


More information about the ckan-dev mailing list