[wdmmg-dev] openspending.org back up

Sat Mar 19 16:34:18 UTC 2011

Hi all,

On Sat, Mar 19, 2011 at 1:23 AM, Stefan Wehrmeyer
<stefanwehrmeyer at gmail.com> wrote:
> On 19.03.2011, at 00:30 , Carsten Senger wrote:
>> Today openspending.org was most of the time down. The maschine was idle
>> most of the time, but almost all connection stalled shortly after every
>> apache restart.

that's weird, we used the site quite a bit hat the workshop and it
seemed to be working (albeit not fast). Also it is worth mentioning
that I was re-loading fts twice during that day (and building the solr
index out of it). This means that each query had to be run (number of
indexes + 1) times during the loading process, so things get really
slow.

>> Stefan struggled to find the cause of it. Later Rufus found out
>> that the cause where enormous mongodb query response times for some of
>> the queries.

The worst we have are those queries coming from wdmmg-legacy, they
have to do a table scan on complex types for all requests and cannot
be traced to a specific dataset (i.e. if I ask "what values are known
for cofog1", the distinct() call will include all other datasets which
do not have cofog1). This could hackishly be solved by pointing the
flash at per-dataset subdomains (which we need to enable anyway) or by
going into the flash to add a dataset parameter.

>> We suspected that it's the amount of data in mongodb. From the mongo stats
>> we 1.4 million objects in the db, had 8 gig of storaged data and 12 gigs of
>> allocated disk space. So we decided to drop data and see what happens.

I must admit I don't like the strategy - its clear the machine is
struggling with memory (htop shows that) but its really weird for us
to have datasets that disappear sometimes (e.g. I wanted to show
someone fts and it wasn't there...). But I accept its probably an
emergency option.

> So it sounds like we need a scaling plan, because in the end we want the data to be available.
> I think we need to figure out exactly what kind of queries we want to support, have these indizes ready and forbid other queries.
> It would be a good idea to abstract all read queries into class methods and stop calling .find arbitrarily.

I wouldn't start refactoring before having a clear picture of whats
going on - i.e. record mongostats and query times for several database
sizes. If memory is the issue (as I suspect - we likely have grown
above 3G in indexes and that means they get swapped to disk), then no
amount of refactoring will solve that. In that case we may want to
reduce the number of indexes we keep and also think about 64-bit kinds
of memory.

We may also want to limit the denormalization of data we do on the
entries collection, i.e. only store "id", "name", "ref" and "label" on
the copied sub-types of dataset, entry and entity. Looking at an entry
from FTS that would cut the collection size by about half and we could
fake most of the lost information via dereference() and solr as
needed.

As a final note, we could consider a different model for
pre-aggregtion where we store much less data (i.e. the output of
mongodb's own aggregation functions) and don't materialize
aggregations in the full way we do at the moment.

>  A proper robots.txt file that disallows /api pages would help, I created a ticket for that:
> http://trac.openspending.org/ticket/50

Cool.

- Friedrich