[ckan-dev] Perfomance work, review requested

Fri Jul 5 11:59:31 UTC 2013

On Thu, Jul 4, 2013 at 6:42 PM, Ian Ward <ian at excess.org> wrote:

> Thank you David, this is exactly what I was hoping for.
>
> On Thu, Jul 4, 2013 at 10:57 AM, David Raznick <david.raznick at okfn.org>
> wrote:
> > On Wed, Jul 3, 2013 at 10:41 PM, Ian Ward <ian at excess.org> wrote:
> >> 53a4e5c (1.9x faster): use the data_dict in solr instead of dictizing
> >> the models from the DB, when possible. I might not have the "when
> >> possible" part correct here.
> >
> > This is great and this change has been considered for a while.   Leaning
> on
> > solr for speed I think is really good.
> >
> > My main concern with not doing this already has not been covered by your
> > pull request and may require a little thought:
> >
> > There are options to make solr commit asynchronously when saving a
> dataset
> > i.e not waiting for the commit to happen in the same request that updated
> > the dataset.  So by making solr the source of the data_dict, when
> displaying
> > the dataset, just after saving it, it will show the old version.  This
> will
> > be very confusing for the user.   We have that problem now, but it just
> > takes a while to get on the search listings, which is less of a bother
> (as
> > people will not immediately go and search for it).  Also I have general
> > fears of making solr the canonical source of the data_dict especially for
> > editing, just on the very off chance they are out of sync.   Nonetheless
> > these probably have workarounds, by probably have a context option to say
> > when it is appropriate to use solr and when it is better to use the db.
>
> Since this action needs to get the dataset from the DB anyway, maybe I
> can just compare the revision with the one from SOLR?
>

Sadly this is not possible as the revision against the package only relates
to the row in the package table itself not everything the dataset
represents.  i.e updating a dataset by changing a resource would not change
this revision.  I still think we need to isolate the times when this is
good and when its bad within a context option, even though this is a bit
ugly.
I think the only way todo it is make it solr by default but exclude them in
the following places
* The view frontend conttroller
* The edit frontend controllers
* The calls at the end of package_update, package_create, resource_update,
package_update

If we can find ways round these later we can add them to using solr.  So a
context option like dataset_from_db seems like the only way todo this.
This will also mean that certain instances can force this option on always
by overwriting the logic functions if wanted.

>
> > This also plays badly with the before_view extension point (which
> admittedly
> > is a bad one).  If you look at the package search is does this after
> > receiving the raw data_dict from the search index:
> >
> > https://github.com/okfn/ckan/blob/master/ckan/logic/action/get.py#L1374
>
> I will look at this, thanks.
>
> >> 973bb8c (4.9x faster): store the package_show-schema validated version
> >> in SOLR data_dict to reduce the work when calling package_show. This
> >> moves some work to the when packages are updated and created, but I
> >> expect that this penalty can be removed because we probably have
> >> already just generated a validated version of the package (no
> >> optimization has been done here yet).
> >
> > I imagine the difference will not be so big for less customized schemas.
>
> True, I should time the standard view schema as well.  Of course, I'm
> most interested in how the performance affects *my* real-world use :-)
>
> > I can not get my head round how this effects the before_view extension
> point
> > but I am sure it could break it in certain circumstances.  Not too
> worried
> > about the breakage though.
> >
> > There are also could be issues with some extensions using the
> validate=False
> > config flag and I do not think this honoured by this pull request.
> >
> > I would be more inclined to have a copy of both validated and unvalidated
> > data_dicts in solr which would make this possible.  (not too worried
> about
> > space issues)
>
> I'd be happy with that solution. I'll work on that.
>
> >> f2a4822 (8x faster): allow actions to return a json string instead of
> >> decoded json data and pass that directly to the caller, skipping the
> >> work decoding json just to re-encode it on the other end. This might
> >> not be the best implementation, but it does offer an extra 60%
> >> improvement, and could be useful for other API calls too.
> >>
> > If anyone uses before_view, this breaks it definitely.  Do not like the
> way
> > this implemented either and the placeholder should be a bit uglier and
> > longer at least.
>
> Sure, why not. The placeholder is just to not match anything in the
> (mostly static) help text.
>
> For before_view maybe I could create a "lazy json" class that decodes
> the json and allows updating the first time someone iterates over it
> or accesses an element, but otherwise leaves the string as-is. This
> being Python I can't perfectly emulate a dict, but I might be able to
> get close enough to not break existing code and still get the speed
> benefit.
>

I am not too happy about this due to its complexity.  Could we not try and
to use an (optional) faster json library instead (ujson or simplejson) .

>
> Ian
>
> _______________________________________________
> ckan-dev mailing list
> ckan-dev at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/ckan-dev
> Unsubscribe: http://lists.okfn.org/mailman/options/ckan-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/ckan-dev/attachments/20130705/aa4f1b9b/attachment-0001.html>