[ckan-dev] Views on data through the web

Francis Irving francis at scraperwiki.com
Sat Jun 18 11:16:31 UTC 2011


On Fri, Jun 17, 2011 at 01:24:10PM +0100, David Raznick wrote:
> > When looking this I thought sqlite was the natural choice.  However, I have
> got several reservations compared to using something like posgres.  I think
> using sqllite in the long term will cause problems.

Me too - I argued with Julian for months about it.

The situation is much better than is at first apparent.

I can't answer for CKAN obviously, as I don't know what your exact use
case is. But for ScraperWiki I can describe the tradeoffs.
 
> * No connection pooling (it was invented for a purpose). You have to select
> a database from the file system and set up a connection each time.

Yes, or reimplement connection pooling. Details of how SQLite allocates
memory are here:

http://www.sqlite.org/malloc.html

> * Not too sure about security, it has not built in security and I imagine
> there is a way to break out to the filesystem if you allow arbitrary sql.

I've never researched this, but my hunch would be that it is more
secure than PostgreSQL for the purpose of running arbitary SQL.

That's because it is used on most new mobile phones, where you can,
say, write arbitary SQLite commands in HTML5 from any web page that
you browse.

i.e. I think that any such bug in SQLite would let you jailbreak an
iPhone.

It also has more test code than nearly any other application (75% of the code)
http://www.sqlite.org/testing.html

And yes, it does have "built in security". You can set permissions
when you open a database, e.g. to make it read only.

> * No decent rate limiting or query timeouts.
> * You have to roll own client server architecture anyway, which is bound to
> be inferior.
> * Concurrent access most likely slower.

Yep - agreed, and a major reason I opposed us using SQLite.
ScraperWiki's use case is more unusual though than it at first
apperas.

On ScraperWiki, we let people attach to multiple database to read from
them at once.

As far as I know, neither PostgreSQL or MySQL support that mode in a
cluster setup. i.e. If I shared databases between servers, I can't
connect separately to each server and do a join between them.
(e.g. see this question about MySQL on StackOverflow
http://stackoverflow.com/questions/3582621/php-mysql-table-join-from-different-host )

That means when we reach that level of scaling, we'll have to write
our own code anyway to handle the pooling.

> * Cant get two processes to insert data at the same time in the same
> database.  (so two processes scraping at the same time is problematic)

For our use at ScraperWiki that is a feature. We have a deliberate
design constraint that scrapers should write only to their own
datastore. 

There are some downsides to that, but there are also lots of
advantages in terms of simplicity, and provenance. You know for sure
what code and what websites data in a particular data store came form.

> * sql query language is much more limited for interesting analysis.  Things
> like postgis, xml, window function ... bla bla bla

> * No projects like hadoopdb that uses clustered postgres instances for large
> data analysis.

I think data large enough to need Hadoop is fundamentally different.

> * I think that dumps should be the standard unit data, (they are essentially
> well thought through csv).  copy to using in posgres is *very fast* ~ 1/4
> mil rows/sec.
> * Not so fast for large datasets or tables.
 
> The big downside of using posgresql is that you will need someone that knows
> how to set it up correctly for this kind of workload.   It may be worth it
> though.

Umm, by the time you have enough load that you need somebody who knows
how to set it up, you'll be able to find / afford one, I'd expect!

A big unspoken thing is that most unanalysed public datasets are
*tiny* by modern IT standards. That is a new change that has crept on
us over the last 10 years. But it is very significant.

Other downsides of PostgreSQL:

* Harder to setup locally. If people are working on data locally, they
need to install a daemon and setup permissions for it. There isn't an
embedded mode.
http://stackoverflow.com/questions/60285/if-possible-how-can-one-embed-postgresql

Would be interested to hear from CKAN people what they think the
various tradeoffs are.

Francis




More information about the ckan-dev mailing list