[School-of-data] BIG data... What is BIG?
Simon Cropper
simoncropper at fossworkflowguides.com
Fri Jul 11 01:57:35 UTC 2014
Thanks Stefan,
I liked your response.
To me the over-riding theme in everyones definition is whether analysis
can be carried out on a single machine. It is interesting that an
adjective for a dataset -- that is, 'BIG' -- actually seems to have
nothing to do with size but more to do with hardware requirements to
process the data -- requirements that will obviously change over time as
new 'super' computers become available to researchers.
You also reminded me of another issue I wished to discuss regarding data
-- what is open data. I will start another thread to discuss this issue.
On 10/07/14 01:59, Stefan Urbanek wrote:
>
> On Jul 9, 2014, at 11:24 AM, Friedrich Lindenberg <friedrich at pudo.org> wrote:
>
>> I guess a good definition of big data is that it’s a term used to sell tech to folks who may not otherwise need it. In that case, the limit is whatever the tool you’re selling can process.
>>
>
> Pretty much agreed.
>
>> It’s also a great way for columnists to discuss the decline of western civilisation symbolised by Mark Zuckerberg. In that case, it’s everything that gets collected by Americans.
>>
>
> BIG depends not only on the amount of actual data, but also on data required to make sense of those data, your computational capacity, required algorithms and their complexity (time and memory), timeliness (do we want it to be realtime?)… many factors. Big data today might be small data tomorrow.
>
> There are very few companies and people in this world that have real big data problem, such as the one mentioned by Friedrich.
>
> When I think of big data I think of data that have roughly the following properties:
>
> 1. data required for atomic operation do not fit into a single memory (RAM)
> 2. CPUs on a single machine do not deliver required timeliness and it is needed to paralelize the process
>
> Other people might have different point of view.
>
> Note that required derived data might (and usually will) increase memory requirements. Derived data might be for example:
>
> * interaction graph/network data constructed from a linear dataset
> * indexes (for querying, aggregation or searching)
> * pre computation and pre aggregation of variables – for faster further processing
> * system annotation (for ETL quality assurance)
> * … many more ...
>
> Algorithm complexity [1] should be taken into account as well, as it might impose hidden memory and computational costs that might not be immediately obvious to the data analyst or engineer. How much memory is your problem going to need? Your data set is 1M rows. Your algorithm is from memory perspective, god forbid, O(n^2), which makes requirement for 1*10^12 rows to be available. Now imagine, that your algorithm requires to have the whole dataset in the memory, to be able to work with it.
>
> The examples above are not illustrating the BIG data problem, just showing what might contribute to your problem to become a BIG data problem.
>
> Basically, you have a big data problem if it can be solved only using a distributed network of devices for processing (CPUs, GPUs) and storing the data (caches, RAMs, disks). Even though, if you are using a distributed network of such devices, it does not necessarily mean that you are solving a big data problem.
>
> Someone said: “Big data is data that does not fit into a spreadsheet”. When I look around and listen to people talking about big data, it looks like to be true...
>
> Concerning Open-Data, I yet have to see a dataset AND a problem to say that it is a BIG data problem. If you know about any, I would be more than happy to know about it. I would love to see one and touch it.
>
> Cheers,
>
> Stefan
>
> [1] https://en.wikipedia.org/wiki/Computational_complexity_theory
>
>> Best,
>>
>> - Friedrich
>>
>> p.s. I honestly don’t think there’s a good definition. “Not reasonably processable on one machine” could be a guideline, but that can be a good dozen terabytes these days?
>>
>> On 09 Jul 2014, at 16:53, Simon Cropper <simoncropper at fossworkflowguides.com> wrote:
>>
>>> Hi,
>>>
>>> I have been exploring various projects that claim to handle BIG data but to be honest most do not qualify what BIG actually means.
>>>
>>> I remember the days when programs specified the maximum number of records, maximum number of fields and maximum number of tables in a database that could be manipulated at any one time. Why don't these types of specs get provided for languages and libraries anymore?
>>>
>>> What are people's impression of what BIG actually means when used to describe large datasets?
>>>
>>> To me BIG is millions of records and multiple linked tables.
>>>
>
--
Cheers Simon
Simon Cropper - Open Content Creator
Free and Open Source Software Workflow Guides
------------------------------------------------------------
Introduction http://www.fossworkflowguides.com
GIS Packages http://www.fossworkflowguides.com/gis
bash / Python http://www.fossworkflowguides.com/scripting
More information about the school-of-data
mailing list