[ckan-dev] Perfomance work, review requested

Thu Jul 4 17:42:38 UTC 2013

Thank you David, this is exactly what I was hoping for.

On Thu, Jul 4, 2013 at 10:57 AM, David Raznick <david.raznick at okfn.org> wrote:
> On Wed, Jul 3, 2013 at 10:41 PM, Ian Ward <ian at excess.org> wrote:
>> 53a4e5c (1.9x faster): use the data_dict in solr instead of dictizing
>> the models from the DB, when possible. I might not have the "when
>> possible" part correct here.
>
> This is great and this change has been considered for a while.   Leaning on
> solr for speed I think is really good.
>
> My main concern with not doing this already has not been covered by your
> pull request and may require a little thought:
>
> There are options to make solr commit asynchronously when saving a dataset
> i.e not waiting for the commit to happen in the same request that updated
> the dataset.  So by making solr the source of the data_dict, when displaying
> the dataset, just after saving it, it will show the old version.  This will
> be very confusing for the user.   We have that problem now, but it just
> takes a while to get on the search listings, which is less of a bother (as
> people will not immediately go and search for it).  Also I have general
> fears of making solr the canonical source of the data_dict especially for
> editing, just on the very off chance they are out of sync.   Nonetheless
> these probably have workarounds, by probably have a context option to say
> when it is appropriate to use solr and when it is better to use the db.

Since this action needs to get the dataset from the DB anyway, maybe I
can just compare the revision with the one from SOLR?

> This also plays badly with the before_view extension point (which admittedly
> is a bad one).  If you look at the package search is does this after
> receiving the raw data_dict from the search index:
>
> https://github.com/okfn/ckan/blob/master/ckan/logic/action/get.py#L1374

I will look at this, thanks.

>> 973bb8c (4.9x faster): store the package_show-schema validated version
>> in SOLR data_dict to reduce the work when calling package_show. This
>> moves some work to the when packages are updated and created, but I
>> expect that this penalty can be removed because we probably have
>> already just generated a validated version of the package (no
>> optimization has been done here yet).
>
> I imagine the difference will not be so big for less customized schemas.

True, I should time the standard view schema as well.  Of course, I'm
most interested in how the performance affects *my* real-world use :-)

> I can not get my head round how this effects the before_view extension point
> but I am sure it could break it in certain circumstances.  Not too worried
> about the breakage though.
>
> There are also could be issues with some extensions using the validate=False
> config flag and I do not think this honoured by this pull request.
>
> I would be more inclined to have a copy of both validated and unvalidated
> data_dicts in solr which would make this possible.  (not too worried about
> space issues)

I'd be happy with that solution. I'll work on that.

>> f2a4822 (8x faster): allow actions to return a json string instead of
>> decoded json data and pass that directly to the caller, skipping the
>> work decoding json just to re-encode it on the other end. This might
>> not be the best implementation, but it does offer an extra 60%
>> improvement, and could be useful for other API calls too.
>>
> If anyone uses before_view, this breaks it definitely.  Do not like the way
> this implemented either and the placeholder should be a bit uglier and
> longer at least.

Sure, why not. The placeholder is just to not match anything in the
(mostly static) help text.

For before_view maybe I could create a "lazy json" class that decodes
the json and allows updating the first time someone iterates over it
or accesses an element, but otherwise leaves the string as-is. This
being Python I can't perfectly emulate a dict, but I might be able to
get close enough to not break existing code and still get the speed
benefit.

Ian