[ckan-dev] moderated edits c(r)ep. http://trac.ckan.org/ticket/1129

Fri May 13 10:02:08 UTC 2011

On 13 May 2011 01:32, David Raznick <kindly at gmail.com> wrote:
> Firstly, the plan on pending changes should be simpler that what was
> outlined in that thread.   4 states seem adequate.

Slightly messed up: should have sent start of that part of the thread:

<http://lists.okfn.org/pipermail/okfn-help/2010-October/000871.html>

Also this wasn't meant to be prescriptive (ie. let's do it like this)
but here's some things we thought about before and issues we came up
with.

> * 'active'  (current approved)
> * 'pending-approval'  (changes that have not been looked at by the
> moderator)
> * 'completed-approval' (changes that have been looked at by the moderator)
> * 'deleted'
>
> Only the moserator/sysadmin can delete.   The moderators can go through the
> pending changes and apply them as they see fit.  When they apply them the
> revision gets marked as completed-approval and a new active rows are made.

New active row where?

> The continuity object just holds the latest change independent of the state
> above.  This may cause some difficulty to easily query the database for
> 'active' revisions but *most* of the time we would hope that the community
> edits would be accepted, so it seems that the latest edit in the continuity
> table makes sense.

Why do this rather than having current active revision? It would seem
more natural for continuity to act like the continuity? (I know there
are problems for initial case where object gets created but the
created revision is pending -- in those circumstances we either a)
don't have an entry in continuity (problematic because revision has
nothing to reference or b) have something in continuity marked as
pending)

>> One big feeling I have on this is that we should resolve #1077 as a
>> pre-requisite for this. I've just done a *big* revision of #1077 and
>> created tickets detailing proposed changes to vdm. I think you may now
>> agree with the conclusion that is there :-)
>
> I actually think that it needs to be changed less than this.  Just the
> relationship handling of state needs to be removed.  I am happy with state
> handling on a per table row basis.

To parse that I think that means you agree with ticket 1137 but want
limited version. The question I still think we need to clarify is why
keep state in main continuity objects? If it were removed we would
have a very simple model where revisionining had no impact on your
normal tables. Of course, one answer is that we do need it for pending
stuff if we go with option (b).

>> The changset model would work with the following provisos.
>>
>> * No foreign key can change.
>
>> Why? (not disagreeing just not sure i understand)
>
> No foreign key can change if you want a way to query the revision history
> easily.  You will have to join on a column in the change_object table. See
> queries below.

I still don't understand here. Which FKs (all of them) and do you mean
FK or the PK pointed to by a FK or ....? I don't know whether this is
important but I don't understand :)

>> > * We are willing to do an extra join for each relationship we have. i.e
>> > the
>> > join will firstly be to continuity object and then back to the
>> > change_object
>> > table.
>> > * No continuity object can get deleted.
>>
>> This I don't understand so again please clarify.
>>
>> I'd assume the join
>> would be lazy rather than an FK relationship (I think we *need* to
>>
>> remove FKs from revision objects / changeobjects to continuity
>> whatever route we go down).
>
> Do we?  I do not think we do.  I am happy to keep the latest change in the
> continuity table.  I do not think we should purge things out of the
> continuity tables.

Why not? It leads to a conceptually simpler model ... (continuity
behaves completely like a normal object and client code for that table
does not need to keep filtering on 'active').

[...]

> It seems very difficult to make myself clear.
>  I think the major misunderstanding is that my proposal 2 relies on
> sqlalchemy very little.  We would need two write hand written sql for all
> our data views.

Wow, ok :-) Does this mean we are *definitely* and forever tied to
postgres? We do run some our tests now on sqlite, a bunch of work was
done on making vdm mysql compatible (though changes are not yet
integrated). I don't object to this decision but we should be clear if
that is what we are doing.

> I am certain this proposal does not need any change in the vdm to work
> either, it is just a way of changing the queries to fit our needs.

Right but in some ways it is simply routing around vdm :-) (that's why
it doesn't need any changes). But that's not a big problem ...

> We would need to change the way we *use* vdm and that's all (removing
> stateful relations).  We can manually change the state 'active' flag to
> 'pending-approval' and the vdm will accept that and copy that to the
> revision table.

Sure I understand. But you do seem to indicate that we want to remove
stateful on relations. This plus SessionExtension change is all 1077
now involves (!) so it does seem we are in agreement :-)

> To give an example of proposal 2.  Say we have the package dict example
> earlier in the thread. Our objective to get the dict out under under the
> following conditions.
>
> * The latest active edit.  (last approved edit)
> * The latest edit.
> * All the unapproved edits (moderation queue).
> * A point in history/time or a particular revision.
>
> We would do this as follows, I will use querying technique "distinct on" as
> discussed in the crep as its the clearest. I did not know it existed before
> looking into this and its very useful.

Aside: I don't think we generally need to discuss 'implementation
details' like exact sql queries for a CREP like this -- I have
complete faith that you know how to implement. What's more important
is what we have discussed above -- the use cases, what kind of ops are
required etc. That said I understand here that these are key queries
and this is relevant to option 1 vs option 2.

> ***** Latest active edit. i.e what we get out currently. ******
>
> We will first need to get out the correct package object. To do this we will
> query.
>
> QUERY 1
>     select distinct on (id) id, timestamp, name, ... from
>     package_revision where state = 'active' and id = 'fdsfs' order by
> timestamp desc
>
> ('...' stands for other columns,  I will also assume the timestamp is cached
> on all the revision tables for the sake of not writing the join to the
> revision table each time)
>
> We will then get out all the resources.  This is a many_to_many so we assume
> there is a package_resource table.
>
> QUERY 2
>     select distinct on (resource.id) id, url, ... from
> package_resource_revision
>     join resource_revision on resource.id = package_resource.resource_id
>     where package_resource_revision.timestamp < timestamp_got_above
> and package_resource_revision.timestamp < timestamp_got_above
>     order by package_resource.timestamp desc, resource.timestamp desc
>
> We would then need to filter out the deleted objects in python.
>
> ***** The latest edit. *****
>
> We can get this out from the continuity tables as we do currently.

I *really* think you want the *active* item 'cached' on the continuity
not the latest edit -- most operations are going to be read operations
...

> ***** All the unapproved edits. *****
>
> We will need a list of all the unapproved edits to do this we need the
> distinct revisions affecting all the tables.  We will get out these
> revisions like we do currently. i.e
> QUERY 3
>    select revision_id, timestamp from package_revision where state =
> 'pending-approval' where id = ' fdsfs'
>    union
>    select revision_id, timestamp from package_resource_revision where state
> = 'pending-approval' and package_id = ' fdsfs'
>    union
>    select revision_id, timestamp from package_resource_revision join
> resource_revision on package_resource_revision.resource_id =
> resource_revision.id where resource_revision.state = 'pending-approval' and
> package_id = ' fdsfs'
>
> This sorted by timestamp gives us our moderater queue. For each item in the
> queue we will need to get out the package like so.
> QUERY 4
>    select distinct on (id) id, timestamp, name, ... from
>    package_revision where id = 'fdsfs' where timestamp <=
> timestamp_got_above  order by timestamp desc
>
> and to get out the resources for each timestamp we can do the same QUERY 2
> above.
>
> ***** A point in history/time or a particular revision. *****
>
> This is essentially covered by QUERY 4 and QUERY 2 above at a particular
> time.

An immediate question I have here is why aren't we joining on the
revision table. People should approve revisions surely (with all
changes associated to that revision)  rather than individual parts of
a revision. A Revision is the atomic change to the system -- you don't
pick bits of it.

You know "Revision" can have state right?

> These queries are very difficult to do with the changeset_objects tables as
> we refer to foreign keys in the revision tables and these foreign keys are
> in the change_objects.   If you can come up with a query strategy using the
> changeset vdm I would be very interested to hear:)

This isn't really relevant as I think we have agreed we aren't doing
changeset model but ...

Qu: The latest active edit.  (last approved edit)

Would be on continuity ...

* The latest edit (to a given object or just generally?)

Generally:

SELECT FROM changeset
  JOIN changeobject
  ORDER BY ...

Specific object (e.g. package with id x):

SELECT FROM changeset
  JOIN changeobject
  WHERE changeobject.table = 'package' AND changeobject.id = package.id

* All the unapproved edits (moderation queue).

just add pending filter to above ...

* A point in history/time or a particular revision.

straightforward i think ...

[...]

Rufus