[ckan-dev] key value store, caching and redis

Tue Feb 1 14:40:37 UTC 2011

On 1 February 2011 13:17, David Raznick <kindly at gmail.com> wrote:
>
>
> On Tue, Feb 1, 2011 at 11:43 AM, Seb Bacon <seb.bacon at gmail.com> wrote:
>>
>> Hi all,
>>
>> I suspect this thread is now a bit stale and the decision has been
>> taken,
>
> I hope it hasn't because I am not even sure about it :)
>
>>
>> but just in the spirit of being devil's advocate...  It seems
>> that the principle justification for introducing another dependency
>> into the system is:
>>
>> > We do not want to be hitting the database for every resource
>>
> To be clear this essentially means that we will not be able to cache any
> page with a resource count on, or a watch count on.

I don't understand; you might be able to serve a hitcount from memory,
but if you want it to be fresh for the user you're still requiring a
server hit to show the page?  Serving from memory only prevents a
database or disk hit which isn't necessarily the biggest performance
hit, especially for a simple select query.  Instead we could, for
example, use varnish extensively and issue purge commands from our
backend when relevant data changes, or just live with stale data for
10 minutes at a time....?

> As far as I am concerned, the decision comes down to...  Do we want to use
> redis for caching in general?

How does storing certain non-critical write operations in memory (and
optionally persisting them) map to caching in general?

Ahhhh.... I think maybe I'm confused because when you talk about
caching in general, you mean data-query-caching-in-general, whereas
I'm thinking about all kinds of caching... perhaps...!  Are you
proposing some kind of redis-backed, generalised database query cache,
or just using an available redis on an ad-hoc basis for caching?

> If we don't want it for caching then definitely not worth it.
>
> If we do use it for caching then it may be worthwhile adding it into the
> core as an optional config option, so the plugins can easily share the same
> redis instance.
>
> Even if we do use it for caching we may not want to add it as a config
> option and let plugins use a redis store without any central management.
> This is the road we are likely to take.

How does this statement fit with "if we don't want [redis] for caching
then definitely not worth it"?

I have a feeling I am being a bit slow here, please bear with me :)

> So the whole thing could be a non issue.
>
>>
>> Why not?  It's not like CKAN instances are sites that need to
>> massively scale.  I am sure we can easily accommodate the levels we
>> typically need with postgres.
>>
>> The other justification is that it's somehow simpler not to use SQL,
>> of which I'm not convinced, at least not when we already have to deal
>> with it anyway.
>
>
> The only thing I do not want is plugins making their own tables/columns in
> the main database as this complicates migrations massively.   This may be
> possible, but would take a lot of thought about dependency issues, to do
> well.

I suppose I've not given any thought to plugins yet myself, hence
perhaps naively thinking a "create table" operation to store some
statistics data isn't such a big deal.

Now I think about it from that perspective, I can feel myself coming
round to the idea that a plugin should use whatever it likes for
storage, but it should follow certain conventions and use a standard
API for joining to the main database. The potential issues with
tightly coupling the plugin storage to the core storage are fairly
self-evident.

>> That said, just to be clear: I'm not really bothered either way :)
>
> As there is no consensus and even I am not sure, I think that a scientific
> approach is best.   I am attempting to run some load testing, to see where
> the bottlenecks are, and maybe this will make a decision clearer.  More
> importantly, hopefully this can shred some light on the downtime experienced
> last week.

I'm really happy you're looking into that :)   But it seems orthogonal
to this?  Say your profiling picks out two or three slow DB queries or
python algorithms as big culprits; they could be whacked with any
number of strategies.   But however we whack them, we still have a
question about what kind of data storage policy to follow for plugins.

(I have a personal preference to caching as close to the browser as
possible, i.e. starting with the browser cache, then proxy caches,
then an accelerator, ending with things like memcached or redis.
Perhaps that's just from habit; my belief is that usually these are
the biggest wins and follow the principle of least surprise, but it's
quite possibly a faulty one :)

Seb

>> On 31 January 2011 09:41, David Raznick <kindly at gmail.com> wrote:
>> > On Mon, Jan 31, 2011 at 9:00 AM, Seb Bacon <seb.bacon at gmail.com> wrote:
>> >>
>> >> On 30 January 2011 11:53, David Raznick <kindly at gmail.com> wrote:
>> >> >> Seb said
>> >> >>As a general point I am no fan of SQL databases
>> >> >
>> >> > I funny enough am a big fan sql databases.  I just do not like them
>> >> > abused.   I like the the way the schema gives you an implicit model
>> >> > of
>> >> > your
>> >> > data, that its got rock solid durability and that they can be queried
>> >> > easily
>> >> > with a well established standard.  I think this is very important for
>> >> > valuable data.
>> >>
>> >> Careful there with my context :)  I said "no fan... *in our webby
>> >> world*".
>> >>
>> >> In a context where rock-solid durability and high levels of querying
>> >> are required, I think they're great :)  And arguably this is the case
>> >> for our package catalogue.  But not really for "I like this"
>> >> applications.  That's what I was trying to say.  Kind of agreeing with
>> >> you, but asking if it's worth introducing a new database for.
>> >>
>> > I was agreeing with you too :).  Just wanted to make sure that I did not
>> > come across as having a secret plan to try and move everything over to
>> > redis.
>> >
>> >>
>> >> > The two questions for me are.
>> >> >
>> >> > 1. Will this increase complexity of the system or simplify it?
>> >> >
>> >> > For me it simplifies it.  Redis is no harder to set up than say
>> >> > memcached.
>> >> > Its *much* easier than something like rabbinmq.
>> >>
>> >> Just because we have something quite hard to set up already in our
>> >> system, doesn't mean adding another thing will simplify it, however
>> >> easy it is to set up.
>> >>
>> >> I take the general point that we already have loads of dependencies.
>> >> And I don't think that a concern to reduce them should be a limiting
>> >> factor in a decision on what technology to use.  But I do think we
>> >> need to be a little bit wary about ensuring our software is easy to
>> >> understand and deploy.
>> >>
>> >> > 2.  Do we need a new solution to caching or storing semi-valuable
>> >> > data
>> >> > in a
>> >> > fast way?
>> >> >
>> >> > I think we do.  I do not see this as a new database, I see it as
>> >> > memcached
>> >> > with some persistence.
>> >>
>> >> One thought: will we need to join data across redis and postgres?
>> >>
>> > Yes but they can be emulated simply.
>> >
>> > Redis stores lists/sets against keys. i.e   key: [package1_id,
>> > package2_id,
>> > package3_id].
>> > So I cant imagine a case where a simple   "select * from package where
>> > package_id in (package1_id, package2_id, package3_id)"  will not suffice
>> > to
>> > emulate a join with no performance issues.
>> >
>> >>
>> >> Seb
>> >>
>> >> _______________________________________________
>> >> ckan-dev mailing list
>> >> ckan-dev at lists.okfn.org
>> >> http://lists.okfn.org/mailman/listinfo/ckan-dev
>> >
>> >
>>
>>
>>
>> --
>> skype: seb.bacon
>> mobile: 07790 939224
>> land: 0207 183 9618
>> web: http://baconconsulting.co.uk
>
>

-- 
skype: seb.bacon
mobile: 07790 939224
land: 0207 183 9618
web: http://baconconsulting.co.uk