[ckan-dev] Views on data through the web

Sat Jun 18 17:24:33 UTC 2011

On Sat, Jun 18, 2011 at 12:16 PM, Francis Irving <francis at scraperwiki.com>wrote:

> On Fri, Jun 17, 2011 at 01:24:10PM +0100, David Raznick wrote:
> > > When looking this I thought sqlite was the natural choice.  However, I
> have
> > got several reservations compared to using something like posgres.  I
> think
> > using sqllite in the long term will cause problems.
>
> Me too - I argued with Julian for months about it.
>
> The situation is much better than is at first apparent.
>
> I can't answer for CKAN obviously, as I don't know what your exact use
> case is. But for ScraperWiki I can describe the tradeoffs.
>
> > * No connection pooling (it was invented for a purpose). You have to
> select
> > a database from the file system and set up a connection each time.
>
> Yes, or reimplement connection pooling. Details of how SQLite allocates
> memory are here:
>
> http://www.sqlite.org/malloc.html
>
> > * Not too sure about security, it has not built in security and I imagine
> > there is a way to break out to the filesystem if you allow arbitrary sql.
>
> I've never researched this, but my hunch would be that it is more
> secure than PostgreSQL for the purpose of running arbitary SQL.
>

> That's because it is used on most new mobile phones, where you can,
> say, write arbitary SQLite commands in HTML5 from any web page that
> you browse.
>
> i.e. I think that any such bug in SQLite would let you jailbreak an
> iPhone.
>
> You are right here, I think that sqlite will be more secure for arbitrary
queries, mainly as its got a much simpler model.

It also has more test code than nearly any other application (75% of the
> code)
> http://www.sqlite.org/testing.html
>
> And yes, it does have "built in security". You can set permissions
> when you open a database, e.g. to make it read only.
>
> > * No decent rate limiting or query timeouts.
> > * You have to roll own client server architecture anyway, which is bound
> to
> > be inferior.
> > * Concurrent access most likely slower.
>
> Yep - agreed, and a major reason I opposed us using SQLite.
> ScraperWiki's use case is more unusual though than it at first
> apperas.
>
> On ScraperWiki, we let people attach to multiple database to read from
> them at once.
>
> As far as I know, neither PostgreSQL or MySQL support that mode in a
> cluster setup. i.e. If I shared databases between servers, I can't
> connect separately to each server and do a join between them.
>

Can you do this with sqlite?  I know you can attach (using attach) a
database to another that is on the same server but I am not sure you can do
it accross a network.  I imagine you would have to copy accross the data.
In that case a posgres loading from a postgres dump would be almost as good.

> (e.g. see this question about MySQL on StackOverflow
>
> http://stackoverflow.com/questions/3582621/php-mysql-table-join-from-different-host)
>
> That means when we reach that level of scaling, we'll have to write
> our own code anyway to handle the pooling.
>

Code that pools across servers you would need in both cases anyway and would
be different from the per server pooling I imagine?

There is also db_link for posgress that means you can do joins across
databases in some way.  Its not great though and you would have to do a lot
of work to make sure it was not transparent to the user.

>
> > * Cant get two processes to insert data at the same time in the same
> > database.  (so two processes scraping at the same time is problematic)
>
> For our use at ScraperWiki that is a feature. We have a deliberate
> design constraint that scrapers should write only to their own
> datastore.
>
> There are some downsides to that, but there are also lots of
> advantages in terms of simplicity, and provenance. You know for sure
> what code and what websites data in a particular data store came form.
>
> > * sql query language is much more limited for interesting analysis.
>  Things
> > like postgis, xml, window function ... bla bla bla
>
> > * No projects like hadoopdb that uses clustered postgres instances for
> large
> > data analysis.
>
> I think data large enough to need Hadoop is fundamentally different.
>
> > * I think that dumps should be the standard unit data, (they are
> essentially
> > well thought through csv).  copy to using in posgres is *very fast* ~ 1/4
> > mil rows/sec.
> > * Not so fast for large datasets or tables.
>
> > The big downside of using posgresql is that you will need someone that
> knows
> > how to set it up correctly for this kind of workload.   It may be worth
> it
> > though.
>
> Umm, by the time you have enough load that you need somebody who knows
> how to set it up, you'll be able to find / afford one, I'd expect!
>
> A big unspoken thing is that most unanalysed public datasets are
> *tiny* by modern IT standards. That is a new change that has crept on
> us over the last 10 years. But it is very significant.
>

It is.  My concern is more about the richness of the tools rather than the
size.  If I could build anything I wanted at the moment it would be a tool
like google refine but with a posgresql backend.  The contrib extensions to
posgresql are getting very interesting (fuzzy matching, trigrams, k nearest
neighbor etc) and as an analyst/data cleaner they are very useful.  A usable
frontend to them would be great...

>
> Other downsides of PostgreSQL:
>
> * Harder to setup locally. If people are working on data locally, they
> need to install a daemon and setup permissions for it. There isn't an
> embedded mode.
>
> http://stackoverflow.com/questions/60285/if-possible-how-can-one-embed-postgresql
>
> This is definatly the biggest problem.

> Would be interested to hear from CKAN people what they think the
> various tradeoffs are.
>
> Francis
>
Thanks for your interesting feedback.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/ckan-dev/attachments/20110618/ba9b6833/attachment-0002.html>