[ckan-dev] key value store, caching and redis

David Raznick kindly at gmail.com
Tue Feb 1 22:24:14 UTC 2011


On Tue, Feb 1, 2011 at 2:40 PM, Seb Bacon <seb.bacon at gmail.com> wrote:

> On 1 February 2011 13:17, David Raznick <kindly at gmail.com> wrote:
> >
> >
> > On Tue, Feb 1, 2011 at 11:43 AM, Seb Bacon <seb.bacon at gmail.com> wrote:
> >>
> >> Hi all,
> >>
> >> I suspect this thread is now a bit stale and the decision has been
> >> taken,
> >
> > I hope it hasn't because I am not even sure about it :)
> >
> >>
> >> but just in the spirit of being devil's advocate...  It seems
> >> that the principle justification for introducing another dependency
> >> into the system is:
> >>
> >> > We do not want to be hitting the database for every resource
> >>
> > To be clear this essentially means that we will not be able to cache any
> > page with a resource count on, or a watch count on.
>
> I don't understand; you might be able to serve a hitcount from memory,
> but if you want it to be fresh for the user you're still requiring a
> server hit to show the page?  Serving from memory only prevents a
> database or disk hit which isn't necessarily the biggest performance
> hit, especially for a simple select query.  Instead we could, for
> example, use varnish extensively and issue purge commands from our
> backend when relevant data changes, or just live with stale data for
> 10 minutes at a time....?
>

There are lots of options.  This is exactly the kind of feedback that I
wanted.

You could be right about it not being a big performance hit, however
sqlalchemy adds to the baggage.  Some more facts about this is what the
profiling is about.

Varnish could be a good solution. I would argue adding varnish adds to the
complexity of the system more.

I do not like keeping stale data for things like a watch plugin as it would
be confusing to the user (After watching something and the number not
changing)

I would also not like it if someone edited a package and it looked like it
had not been changed.

More importantly I think an api call should be able to update a record and
instantly receive what it had just changed.  This seems like a necessity if
we link directly to DGU.

Cache invalidation is the age old difficult problem.


> > As far as I am concerned, the decision comes down to...  Do we want to
> use
> > redis for caching in general?
>
> How does storing certain non-critical write operations in memory (and
> optionally persisting them) map to caching in general?
>

My opinion is that I do not want to add to the complexity of the system too
much.  If redis was only used for the storing of non critical write
operations it would not be worth the adding due to the increased
complexity.  If it had a duel purpose it most definitely would.


>
> Ahhhh.... I think maybe I'm confused because when you talk about
> caching in general, you mean data-query-caching-in-general, whereas
> I'm thinking about all kinds of caching... perhaps...!  Are you
> proposing some kind of redis-backed, generalised database query cache,
> or just using an available redis on an ad-hoc basis for caching?
>
> I am thinking about all types of caching (except browser cache), you were
right to begin with.


> > If we don't want it for caching then definitely not worth it.
> >
> > If we do use it for caching then it may be worthwhile adding it into the
> > core as an optional config option, so the plugins can easily share the
> same
> > redis instance.
> >
> > Even if we do use it for caching we may not want to add it as a config
> > option and let plugins use a redis store without any central management.
> > This is the road we are likely to take.
>
> How does this statement fit with "if we don't want [redis] for caching
> then definitely not worth it"?
>

It doesn't.  I was just running through the possibilities of what I think we
could do.
To be more consistent I should have said in the first statement "definitely
not worth adding it to the core if its not going to be used for caching".

>
> I have a feeling I am being a bit slow here, please bear with me :)
>
 > So the whole thing could be a non issue.
> >
> >>
> >> Why not?  It's not like CKAN instances are sites that need to
> >> massively scale.  I am sure we can easily accommodate the levels we
> >> typically need with postgres.
> >>
> >> The other justification is that it's somehow simpler not to use SQL,
> >> of which I'm not convinced, at least not when we already have to deal
> >> with it anyway.
> >
> >
> > The only thing I do not want is plugins making their own tables/columns
> in
> > the main database as this complicates migrations massively.   This may be
> > possible, but would take a lot of thought about dependency issues, to do
> > well.
>
> I suppose I've not given any thought to plugins yet myself, hence
> perhaps naively thinking a "create table" operation to store some
> statistics data isn't such a big deal.
>
> Now I think about it from that perspective, I can feel myself coming
> round to the idea that a plugin should use whatever it likes for
> storage, but it should follow certain conventions and use a standard
> API for joining to the main database. The potential issues with
> tightly coupling the plugin storage to the core storage are fairly
> self-evident.
>

> >> That said, just to be clear: I'm not really bothered either way :)
> >
> > As there is no consensus and even I am not sure, I think that a
> scientific
> > approach is best.   I am attempting to run some load testing, to see
> where
> > the bottlenecks are, and maybe this will make a decision clearer.  More
> > importantly, hopefully this can shred some light on the downtime
> experienced
> > last week.
>
> I'm really happy you're looking into that :)   But it seems orthogonal
> to this?  Say your profiling picks out two or three slow DB queries or
> python algorithms as big culprits; they could be whacked with any
> number of strategies.   But however we whack them, we still have a
> question about what kind of data storage policy to follow for plugins.
>


I see redis as a good tool for both general caching and one that is good for
k/v storage of not so valuable data. Plugins generally want this type of
storage, as the use cases suggest, especially as its good at atomic
counters.

I may be wrong about lumping them together as the same thing.


>
> (I have a personal preference to caching as close to the browser as
> possible, i.e. starting with the browser cache, then proxy caches,
> then an accelerator, ending with things like memcached or redis.
> Perhaps that's just from habit; my belief is that usually these are
> the biggest wins and follow the principle of least surprise, but it's
> quite possibly a faulty one :)
>

For high volume, fairly static sites, or where cache invalidation is
straight forward, this approach is definitely to way to go.

I do not think it is correct though for projects like ckan, where you want
fairly complicated data manipulation, you have specialist users and they
want instant feedback.

I think you want to provably know the user has the latest information and
you want this to be testable.  You especially want to know that api calls
will be upto date.

Having said that, this may be possible with varnish and it will be faster,
but I fear you will have less control, be less easily testable and have more
complexity.

Nonetheless, I think I may be a bit of a control freak as I like the
application knowing everything thats going on.

David

>
> Seb
>

>
>
> >> On 31 January 2011 09:41, David Raznick <kindly at gmail.com> wrote:
> >> > On Mon, Jan 31, 2011 at 9:00 AM, Seb Bacon <seb.bacon at gmail.com>
> wrote:
> >> >>
> >> >> On 30 January 2011 11:53, David Raznick <kindly at gmail.com> wrote:
> >> >> >> Seb said
> >> >> >>As a general point I am no fan of SQL databases
> >> >> >
> >> >> > I funny enough am a big fan sql databases.  I just do not like them
> >> >> > abused.   I like the the way the schema gives you an implicit model
> >> >> > of
> >> >> > your
> >> >> > data, that its got rock solid durability and that they can be
> queried
> >> >> > easily
> >> >> > with a well established standard.  I think this is very important
> for
> >> >> > valuable data.
> >> >>
> >> >> Careful there with my context :)  I said "no fan... *in our webby
> >> >> world*".
> >> >>
> >> >> In a context where rock-solid durability and high levels of querying
> >> >> are required, I think they're great :)  And arguably this is the case
> >> >> for our package catalogue.  But not really for "I like this"
> >> >> applications.  That's what I was trying to say.  Kind of agreeing
> with
> >> >> you, but asking if it's worth introducing a new database for.
> >> >>
> >> > I was agreeing with you too :).  Just wanted to make sure that I did
> not
> >> > come across as having a secret plan to try and move everything over to
> >> > redis.
> >> >
> >> >>
> >> >> > The two questions for me are.
> >> >> >
> >> >> > 1. Will this increase complexity of the system or simplify it?
> >> >> >
> >> >> > For me it simplifies it.  Redis is no harder to set up than say
> >> >> > memcached.
> >> >> > Its *much* easier than something like rabbinmq.
> >> >>
> >> >> Just because we have something quite hard to set up already in our
> >> >> system, doesn't mean adding another thing will simplify it, however
> >> >> easy it is to set up.
> >> >>
> >> >> I take the general point that we already have loads of dependencies.
> >> >> And I don't think that a concern to reduce them should be a limiting
> >> >> factor in a decision on what technology to use.  But I do think we
> >> >> need to be a little bit wary about ensuring our software is easy to
> >> >> understand and deploy.
> >> >>
> >> >> > 2.  Do we need a new solution to caching or storing semi-valuable
> >> >> > data
> >> >> > in a
> >> >> > fast way?
> >> >> >
> >> >> > I think we do.  I do not see this as a new database, I see it as
> >> >> > memcached
> >> >> > with some persistence.
> >> >>
> >> >> One thought: will we need to join data across redis and postgres?
> >> >>
> >> > Yes but they can be emulated simply.
> >> >
> >> > Redis stores lists/sets against keys. i.e   key: [package1_id,
> >> > package2_id,
> >> > package3_id].
> >> > So I cant imagine a case where a simple   "select * from package where
> >> > package_id in (package1_id, package2_id, package3_id)"  will not
> suffice
> >> > to
> >> > emulate a join with no performance issues.
> >> >
> >> >>
> >> >> Seb
> >> >>
> >> >> _______________________________________________
> >> >> ckan-dev mailing list
> >> >> ckan-dev at lists.okfn.org
> >> >> http://lists.okfn.org/mailman/listinfo/ckan-dev
> >> >
> >> >
> >>
> >>
> >>
> >> --
> >> skype: seb.bacon
> >> mobile: 07790 939224
> >> land: 0207 183 9618
> >> web: http://baconconsulting.co.uk
> >
> >
>
>
>
> --
> skype: seb.bacon
> mobile: 07790 939224
> land: 0207 183 9618
> web: http://baconconsulting.co.uk
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/ckan-dev/attachments/20110201/2305d19a/attachment-0001.html>


More information about the ckan-dev mailing list