[ckan-dev] Views on data through the web

Fri Jun 17 12:24:10 UTC 2011

On Thu, Jun 16, 2011 at 11:16 AM, Rufus Pollock <rufus.pollock at okfn.org>wrote:

> On 16 June 2011 00:25, Francis Irving <francis at scraperwiki.com> wrote:
> > Yep, I hope we can share the SQL data server.
> >
> > I think the core of it is the same for each of us - and one day we'll
> > both want multiple server scaling of some sort, and no doubt lots of
> > other things.
> >
> > It would make sense for it to be a separate open source project.
>
> Agreed. I took a look at the code recently (I think it was
> scraperlibs/scraperwiki/sqlite.py and datastore.py in same directory).
> Not sure I had a good understanding of what was going on though. How
> would we go about collaborating on turning this into a standalone lib
> (if we all wanted to do this)?
>
> > Quite surprising that we've both settled on SQLite for this...
> > (Everybody else busy talking about CouchDB and MongoDB...)
>
> It's lightweight and portable and you can get it everywhere - plus we
> want lots of little dbs not one mega-db. (Plus I've used couchdb and
> mongodb and I'm still rather dubious of benefits over tried and tested
> sql ...)
>
> When looking this I thought sqlite was the natural choice.  However, I have
got several reservations compared to using something like posgres.  I think
using sqllite in the long term will cause problems.

* No connection pooling (it was invented for a purpose). You have to select
a database from the file system and set up a connection each time.
* Not too sure about security, it has not built in security and I imagine
there is a way to break out to the filesystem if you allow arbitrary sql.
* No decent rate limiting or query timeouts.
* You have to roll own client server architecture anyway, which is bound to
be inferior.
* Concurrent access most likely slower.
* Cant get two processes to insert data at the same time in the same
database.  (so two processes scraping at the same time is problematic)
* sql query language is much more limited for interesting analysis.  Things
like postgis, xml, window function ... bla bla bla
* No projects like hadoopdb that uses clustered postgres instances for large
data analysis.
* I think that dumps should be the standard unit data, (they are essentially
well thought through csv).  copy to using in posgres is *very fast* ~ 1/4
mil rows/sec.
* Not so fast for large datasets or tables.

The big downside of using posgresql is that you will need someone that knows
how to set it up correctly for this kind of workload.   It may be worth it
though.

David
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/ckan-dev/attachments/20110617/c9bc0253/attachment-0001.html>