[ckan-dev] Perfomance work, review requested

Toby Dacre toby.okfn at gmail.com
Thu Jul 18 09:35:26 UTC 2013


On 5 July 2013 12:59, David Raznick <david.raznick at okfn.org> wrote:
>
>
>
> On Thu, Jul 4, 2013 at 6:42 PM, Ian Ward <ian at excess.org> wrote:
>>
>> Thank you David, this is exactly what I was hoping for.
>>
>> On Thu, Jul 4, 2013 at 10:57 AM, David Raznick <david.raznick at okfn.org>
>> wrote:
>> > On Wed, Jul 3, 2013 at 10:41 PM, Ian Ward <ian at excess.org> wrote:
>> >> 53a4e5c (1.9x faster): use the data_dict in solr instead of dictizing
>> >> the models from the DB, when possible. I might not have the "when
>> >> possible" part correct here.
>> >
>> > This is great and this change has been considered for a while.   Leaning
>> > on
>> > solr for speed I think is really good.
>> >
>> > My main concern with not doing this already has not been covered by your
>> > pull request and may require a little thought:
>> >
>> > There are options to make solr commit asynchronously when saving a
>> > dataset
>> > i.e not waiting for the commit to happen in the same request that
>> > updated
>> > the dataset.  So by making solr the source of the data_dict, when
>> > displaying
>> > the dataset, just after saving it, it will show the old version.  This
>> > will
>> > be very confusing for the user.   We have that problem now, but it just
>> > takes a while to get on the search listings, which is less of a bother
>> > (as
>> > people will not immediately go and search for it).  Also I have general
>> > fears of making solr the canonical source of the data_dict especially
>> > for
>> > editing, just on the very off chance they are out of sync.   Nonetheless
>> > these probably have workarounds, by probably have a context option to
>> > say
>> > when it is appropriate to use solr and when it is better to use the db.
>>
>> Since this action needs to get the dataset from the DB anyway, maybe I
>> can just compare the revision with the one from SOLR?
>
>
> Sadly this is not possible as the revision against the package only relates
> to the row in the package table itself not everything the dataset
> represents.  i.e updating a dataset by changing a resource would not change
> this revision.  I still think we need to isolate the times when this is good
> and when its bad within a context option, even though this is a bit ugly.
> I think the only way todo it is make it solr by default but exclude them in
> the following places
> * The view frontend conttroller
> * The edit frontend controllers
> * The calls at the end of package_update, package_create, resource_update,
> package_update
>
> If we can find ways round these later we can add them to using solr.  So a
> context option like dataset_from_db seems like the only way todo this.
> This will also mean that certain instances can force this option on always
> by overwriting the logic functions if wanted.
>
>>
>>
>> > This also plays badly with the before_view extension point (which
>> > admittedly
>> > is a bad one).  If you look at the package search is does this after
>> > receiving the raw data_dict from the search index:
>> >
>> > https://github.com/okfn/ckan/blob/master/ckan/logic/action/get.py#L1374
>>
>> I will look at this, thanks.
>>
>> >> 973bb8c (4.9x faster): store the package_show-schema validated version
>> >> in SOLR data_dict to reduce the work when calling package_show. This
>> >> moves some work to the when packages are updated and created, but I
>> >> expect that this penalty can be removed because we probably have
>> >> already just generated a validated version of the package (no
>> >> optimization has been done here yet).
>> >
>> > I imagine the difference will not be so big for less customized schemas.
>>
>> True, I should time the standard view schema as well.  Of course, I'm
>> most interested in how the performance affects *my* real-world use :-)
>>
>> > I can not get my head round how this effects the before_view extension
>> > point
>> > but I am sure it could break it in certain circumstances.  Not too
>> > worried
>> > about the breakage though.
>> >
>> > There are also could be issues with some extensions using the
>> > validate=False
>> > config flag and I do not think this honoured by this pull request.
>> >
>> > I would be more inclined to have a copy of both validated and
>> > unvalidated
>> > data_dicts in solr which would make this possible.  (not too worried
>> > about
>> > space issues)
>>
>> I'd be happy with that solution. I'll work on that.
>>
>> >> f2a4822 (8x faster): allow actions to return a json string instead of
>> >> decoded json data and pass that directly to the caller, skipping the
>> >> work decoding json just to re-encode it on the other end. This might
>> >> not be the best implementation, but it does offer an extra 60%
>> >> improvement, and could be useful for other API calls too.
>> >>
>> > If anyone uses before_view, this breaks it definitely.  Do not like the
>> > way
>> > this implemented either and the placeholder should be a bit uglier and
>> > longer at least.
>>
>> Sure, why not. The placeholder is just to not match anything in the
>> (mostly static) help text.
>>
>> For before_view maybe I could create a "lazy json" class that decodes
>> the json and allows updating the first time someone iterates over it
>> or accesses an element, but otherwise leaves the string as-is. This
>> being Python I can't perfectly emulate a dict, but I might be able to
>> get close enough to not break existing code and still get the speed
>> benefit.
>
>
> I am not too happy about this due to its complexity.  Could we not try and
> to use an (optional) faster json library instead (ujson or simplejson) .
>

currently we use simplejson
https://github.com/okfn/ckan/blob/master/ckan/common.py#L12 but
changing would be easy


>>
>>
>> Ian
>>
>> _______________________________________________
>> ckan-dev mailing list
>> ckan-dev at lists.okfn.org
>> http://lists.okfn.org/mailman/listinfo/ckan-dev
>> Unsubscribe: http://lists.okfn.org/mailman/options/ckan-dev
>
>
>
> _______________________________________________
> ckan-dev mailing list
> ckan-dev at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/ckan-dev
> Unsubscribe: http://lists.okfn.org/mailman/options/ckan-dev
>



-- 
Toby Dacre

The Open Knowledge Foundation

Empowering through Open Knowledge
http://okfn.org/  |  @okfn




More information about the ckan-dev mailing list