[School-of-data] BIG data... What is BIG?

Stefan Urbanek stefan.urbanek at gmail.com
Wed Jul 9 20:00:22 UTC 2014


Thanks Julian, great find. That is a nice example of huge dataset that does not necessarily have to be a big data problem. 4m rows/month is far from "big data". Last project I was working on was ~70m rows per day and still not considered a big data problem – just streaming and aggregation here and there. If we had to do few joins with other dimensions (large or not) then we might approach a certain big data problem…

Which reminds me, that we should rather talk about “big data problems” than about “big data” (as in volume).

Stefan

p.s.: Speaking of which, a bit off-topic … it would be nice to have a collection of those “huge open datasets” just to test performance of open-source tools and possibly allow a big data *problem* to be created …

On Jul 9, 2014, at 3:46 PM, Julian Tait <julian at thegarden.io> wrote:

> Hi Stefan,
> 
> I think the data that is being released by the National Health Service in the UK around prescribed medicines could be considered BIG - over 1GB per month, approx 4 million lines. http://www.hscic.gov.uk/searchcatalogue?productid=12419&topics=0%2fPrescribing&infotype=0%2fOpen+data&sort=Relevance&size=10&page=1#top with the potential to help solve BIG problems such as the prevalence of proprietary medication over generic - a big cost for the public health service or whether certain areas are anomalous in the way that anti-depressants are used etc.
> 
> Cheers
> 
> Julian
> 
> 9 Jul 2014, at 20:25, Stefan Urbanek wrote:
> 
>> 
>> On Jul 9, 2014, at 12:55 PM, Laura James <laura.james at okfn.org> wrote:
>> 
>>> On 9 July 2014 16:59, Stefan Urbanek <stefan.urbanek at gmail.com> wrote:
>>> 
>>> Concerning Open-Data, I yet have to see a dataset AND a problem to say that it is a BIG data problem. If you know about any, I would be more than happy to know about it. I would love to see one and touch it.
>>> 
>>> Depending on the definition of big data, something like the Ensembl human genome data? (210GB annotated version)
>>> 
>>> I imagine there may also be some pretty large NASA-generated open data sets. 
>>> 
>> 
>> Ah, you are right. thanks. I didn’t counted them in, because I think of them more as “scientific data” that happen to be open. Therefore I would rephrase myself as “I yet have to see a non-scientific open-data dataset(s) and a problem that is a big data problem” :-)
>> 
>> Stefan
>> 
>> _______________________________________________
>> school-of-data mailing list
>> school-of-data at lists.okfn.org
>> https://lists.okfn.org/mailman/listinfo/school-of-data
>> Unsubscribe: https://lists.okfn.org/mailman/options/school-of-data
> 
> _______________________________________________
> school-of-data mailing list
> school-of-data at lists.okfn.org
> https://lists.okfn.org/mailman/listinfo/school-of-data
> Unsubscribe: https://lists.okfn.org/mailman/options/school-of-data

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/school-of-data/attachments/20140709/e321fcee/attachment-0002.html>


More information about the school-of-data mailing list