[School-of-data] BIG data... What is BIG? (Christy Okpo)

E.C Okpo ecokpo at gmail.com
Thu Jul 10 04:10:28 UTC 2014


I also think Big Data can refer to the breadth of data captured.

Take for example, the eCommerce space where data collected on customer
behavior is geared towards gathering an exhaustingly wide range of
variables, so as to get a fuller picture of that customer's buying
patterns.

Datasets can be millions of rows but still be quite simple. In cases where
the end-goal is in discovering some hidden behavioral or other subtle
patterns, the breadth of data collected is truly was makes it BIG.

Algorithmic complexity, distributed computing, computational costs and
scalability are some of the issues that arise because of this breadth.

Cheers,

Christy

On Wed, Jul 9, 2014 at 10:55 AM, <school-of-data-request at lists.okfn.org>
wrote:

> Send school-of-data mailing list submissions to
>         school-of-data at lists.okfn.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
>         https://lists.okfn.org/mailman/listinfo/school-of-data
> or, via email, send a message with subject or body 'help' to
>         school-of-data-request at lists.okfn.org
>
> You can reach the person managing the list at
>         school-of-data-owner at lists.okfn.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of school-of-data digest..."
>
>
> Today's Topics:
>
>    1. BIG data... What is BIG? (Simon Cropper)
>    2. Re: BIG data... What is BIG? (Friedrich Lindenberg)
>    3. Re: BIG data... What is BIG? (Stefan Urbanek)
>    4. Re: BIG data... What is BIG? (Cl?ment Renaud)
>    5. Re: BIG data... What is BIG? (Laura James)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Thu, 10 Jul 2014 00:53:32 +1000
> From: Simon Cropper <simoncropper at fossworkflowguides.com>
> To: school-of-data at lists.okfn.org
> Subject: [School-of-data] BIG data... What is BIG?
> Message-ID: <53BD576C.7030205 at fossworkflowguides.com>
> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
>
> Hi,
>
> I have been exploring various projects that claim to handle BIG data but
> to be honest most do not qualify what BIG actually means.
>
> I remember the days when programs specified the maximum number of
> records, maximum number of fields and maximum number of tables in a
> database that could be manipulated at any one time. Why don't these
> types of specs get provided for languages and libraries anymore?
>
> What are people's impression of what BIG actually means when used to
> describe large datasets?
>
> To me BIG is millions of records and multiple linked tables.
>
> --
> Cheers Simon
>
>     Simon Cropper - Open Content Creator
>
>     Free and Open Source Software Workflow Guides
>     ------------------------------------------------------------
>     Introduction               http://www.fossworkflowguides.com
>     GIS Packages           http://www.fossworkflowguides.com/gis
>     bash / Python    http://www.fossworkflowguides.com/scripting
>
>
>
> ------------------------------
>
> Message: 2
> Date: Wed, 9 Jul 2014 17:24:13 +0200
> From: Friedrich Lindenberg <friedrich at pudo.org>
> To: "Mailing list for the School of Data, a joint initiative of the
>         OKFN and P2PU" <school-of-data at lists.okfn.org>
> Subject: Re: [School-of-data] BIG data... What is BIG?
> Message-ID: <8C241218-C26E-4496-985B-7EAA45C438D4 at pudo.org>
> Content-Type: text/plain; charset="windows-1252"
>
> I guess a good definition of big data is that it?s a term used to sell
> tech to folks who may not otherwise need it. In that case, the limit is
> whatever the tool you?re selling can process.
>
> It?s also a great way for columnists to discuss the decline of western
> civilisation symbolised by Mark Zuckerberg. In that case, it?s everything
> that gets collected by Americans.
>
> Best,
>
> - Friedrich
>
> p.s. I honestly don?t think there?s a good definition. ?Not reasonably
> processable on one machine? could be a guideline, but that can be a good
> dozen terabytes these days?
>
> On 09 Jul 2014, at 16:53, Simon Cropper <
> simoncropper at fossworkflowguides.com> wrote:
>
> > Hi,
> >
> > I have been exploring various projects that claim to handle BIG data but
> to be honest most do not qualify what BIG actually means.
> >
> > I remember the days when programs specified the maximum number of
> records, maximum number of fields and maximum number of tables in a
> database that could be manipulated at any one time. Why don't these types
> of specs get provided for languages and libraries anymore?
> >
> > What are people's impression of what BIG actually means when used to
> describe large datasets?
> >
> > To me BIG is millions of records and multiple linked tables.
> >
> > --
> > Cheers Simon
> >
> >   Simon Cropper - Open Content Creator
> >
> >   Free and Open Source Software Workflow Guides
> >   ------------------------------------------------------------
> >   Introduction               http://www.fossworkflowguides.com
> >   GIS Packages           http://www.fossworkflowguides.com/gis
> >   bash / Python    http://www.fossworkflowguides.com/scripting
> >
> > _______________________________________________
> > school-of-data mailing list
> > school-of-data at lists.okfn.org
> > https://lists.okfn.org/mailman/listinfo/school-of-data
> > Unsubscribe: https://lists.okfn.org/mailman/options/school-of-data
>
> -------------- next part --------------
> A non-text attachment was scrubbed...
> Name: signature.asc
> Type: application/pgp-signature
> Size: 842 bytes
> Desc: Message signed with OpenPGP using GPGMail
> URL: <
> http://lists.okfn.org/pipermail/school-of-data/attachments/20140709/9603d45f/attachment-0001.sig
> >
>
> ------------------------------
>
> Message: 3
> Date: Wed, 9 Jul 2014 11:59:58 -0400
> From: Stefan Urbanek <stefan.urbanek at gmail.com>
> To: "Mailing list for the School of Data, a joint initiative of the
>         OKFN and P2PU" <school-of-data at lists.okfn.org>
> Subject: Re: [School-of-data] BIG data... What is BIG?
> Message-ID: <80420B66-2F4A-46AF-BFD7-477F5974F3DC at gmail.com>
> Content-Type: text/plain; charset=windows-1252
>
>
> On Jul 9, 2014, at 11:24 AM, Friedrich Lindenberg <friedrich at pudo.org>
> wrote:
>
> > I guess a good definition of big data is that it?s a term used to sell
> tech to folks who may not otherwise need it. In that case, the limit is
> whatever the tool you?re selling can process.
> >
>
> Pretty much agreed.
>
> > It?s also a great way for columnists to discuss the decline of western
> civilisation symbolised by Mark Zuckerberg. In that case, it?s everything
> that gets collected by Americans.
> >
>
> BIG depends not only on the amount of actual data, but also on data
> required to make sense of those data, your computational capacity, required
> algorithms and their complexity (time and memory), timeliness (do we want
> it to be realtime?)? many factors. Big data today might be small data
> tomorrow.
>
> There are very few companies and people in this world that have real big
> data problem, such as the one mentioned by Friedrich.
>
> When I think of big data I think of data that have roughly the following
> properties:
>
> 1. data required for atomic operation do not fit into a single memory (RAM)
> 2. CPUs on a single machine do not deliver required timeliness and it is
> needed to paralelize the process
>
> Other people might have different point of view.
>
> Note that required derived data might (and usually will) increase memory
> requirements. Derived data might be for example:
>
> * interaction graph/network data constructed from a linear dataset
> * indexes (for querying, aggregation or searching)
> * pre computation and pre aggregation of variables ? for faster further
> processing
> * system annotation (for ETL quality assurance)
> * ? many more ...
>
> Algorithm complexity [1] should be taken into account as well, as it might
> impose hidden memory and computational costs that might not be immediately
> obvious to the data analyst or engineer. How much memory is your problem
> going to need? Your data set is 1M rows. Your algorithm is from memory
> perspective, god forbid, O(n^2), which makes requirement for 1*10^12 rows
> to be available. Now imagine, that your algorithm requires to have the
> whole dataset in the memory, to be able to work with it.
>
> The examples above are not illustrating the BIG data problem, just showing
> what might contribute to your problem to become a BIG data problem.
>
> Basically, you have a big data problem if it can be solved only using a
> distributed network of devices for processing (CPUs, GPUs) and storing the
> data (caches, RAMs, disks). Even though, if you are using a distributed
> network of such devices, it does not necessarily mean that you are solving
> a big data problem.
>
> Someone said: ?Big data is data that does not fit into a spreadsheet?.
> When I look around and listen to people talking about big data, it looks
> like to be true...
>
> Concerning Open-Data, I yet have to see a dataset AND a problem to say
> that it is a BIG data problem. If you know about any, I would be more than
> happy to know about it. I would love to see one and touch it.
>
> Cheers,
>
> Stefan
>
> [1] https://en.wikipedia.org/wiki/Computational_complexity_theory
>
> > Best,
> >
> > - Friedrich
> >
> > p.s. I honestly don?t think there?s a good definition. ?Not reasonably
> processable on one machine? could be a guideline, but that can be a good
> dozen terabytes these days?
> >
> > On 09 Jul 2014, at 16:53, Simon Cropper <
> simoncropper at fossworkflowguides.com> wrote:
> >
> >> Hi,
> >>
> >> I have been exploring various projects that claim to handle BIG data
> but to be honest most do not qualify what BIG actually means.
> >>
> >> I remember the days when programs specified the maximum number of
> records, maximum number of fields and maximum number of tables in a
> database that could be manipulated at any one time. Why don't these types
> of specs get provided for languages and libraries anymore?
> >>
> >> What are people's impression of what BIG actually means when used to
> describe large datasets?
> >>
> >> To me BIG is millions of records and multiple linked tables.
> >>
> >> --
> >> Cheers Simon
> >>
> >>  Simon Cropper - Open Content Creator
> >>
> >>  Free and Open Source Software Workflow Guides
> >>  ------------------------------------------------------------
> >>  Introduction               http://www.fossworkflowguides.com
> >>  GIS Packages           http://www.fossworkflowguides.com/gis
> >>  bash / Python    http://www.fossworkflowguides.com/scripting
> >>
> >> _______________________________________________
> >> school-of-data mailing list
> >> school-of-data at lists.okfn.org
> >> https://lists.okfn.org/mailman/listinfo/school-of-data
> >> Unsubscribe: https://lists.okfn.org/mailman/options/school-of-data
> >
> > _______________________________________________
> > school-of-data mailing list
> > school-of-data at lists.okfn.org
> > https://lists.okfn.org/mailman/listinfo/school-of-data
> > Unsubscribe: https://lists.okfn.org/mailman/options/school-of-data
>
>
>
> ------------------------------
>
> Message: 4
> Date: Wed, 9 Jul 2014 18:00:09 +0200
> From: Cl?ment Renaud <clement.renaud at gmail.com>
> To: "Mailing list for the School of Data, a joint initiative of the
>         OKFN and P2PU" <school-of-data at lists.okfn.org>
> Subject: Re: [School-of-data] BIG data... What is BIG?
> Message-ID:
>         <CAJLOONLm_posrL6_r_16+6D=
> tamsO47ObQKRR3D3EnSVTLyL0Q at mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> Hi there,
>
> One aspect of "big data" in engineering is that the database doesn't fit in
> a single machine and therefore is distributed among several computers
> (large search engines index for instance). Those sort of data requires
> specific algorithms development that can handle concurrent tasks on
> multiple clusters of machines.
> I don't think something that doesn"t fit in your RAM can be called big
> data.
>
> But I agree with Friedrich : it is mostly a keyword for the industry to get
> excited about sth new.
> Unlike "Big Oil" it does still have a sexy feeling for the audience when
> said by journalists or policy makers.
>
>
>
>
>
> 2014-07-09 17:24 GMT+02:00 Friedrich Lindenberg <friedrich at pudo.org>:
>
> > I guess a good definition of big data is that it?s a term used to sell
> > tech to folks who may not otherwise need it. In that case, the limit is
> > whatever the tool you?re selling can process.
> >
> > It?s also a great way for columnists to discuss the decline of western
> > civilisation symbolised by Mark Zuckerberg. In that case, it?s everything
> > that gets collected by Americans.
> >
> > Best,
> >
> > - Friedrich
> >
> > p.s. I honestly don?t think there?s a good definition. ?Not reasonably
> > processable on one machine? could be a guideline, but that can be a good
> > dozen terabytes these days?
> >
> > On 09 Jul 2014, at 16:53, Simon Cropper <
> > simoncropper at fossworkflowguides.com> wrote:
> >
> > > Hi,
> > >
> > > I have been exploring various projects that claim to handle BIG data
> but
> > to be honest most do not qualify what BIG actually means.
> > >
> > > I remember the days when programs specified the maximum number of
> > records, maximum number of fields and maximum number of tables in a
> > database that could be manipulated at any one time. Why don't these types
> > of specs get provided for languages and libraries anymore?
> > >
> > > What are people's impression of what BIG actually means when used to
> > describe large datasets?
> > >
> > > To me BIG is millions of records and multiple linked tables.
> > >
> > > --
> > > Cheers Simon
> > >
> > >   Simon Cropper - Open Content Creator
> > >
> > >   Free and Open Source Software Workflow Guides
> > >   ------------------------------------------------------------
> > >   Introduction               http://www.fossworkflowguides.com
> > >   GIS Packages           http://www.fossworkflowguides.com/gis
> > >   bash / Python    http://www.fossworkflowguides.com/scripting
> > >
> > > _______________________________________________
> > > school-of-data mailing list
> > > school-of-data at lists.okfn.org
> > > https://lists.okfn.org/mailman/listinfo/school-of-data
> > > Unsubscribe: https://lists.okfn.org/mailman/options/school-of-data
> >
> >
> > _______________________________________________
> > school-of-data mailing list
> > school-of-data at lists.okfn.org
> > https://lists.okfn.org/mailman/listinfo/school-of-data
> > Unsubscribe: https://lists.okfn.org/mailman/options/school-of-data
> >
>
>
>
> --
> Cl?ment Renaud
>
> @clemsos
> www.clementrenaud.com
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <
> http://lists.okfn.org/pipermail/school-of-data/attachments/20140709/eb23be80/attachment-0001.html
> >
>
> ------------------------------
>
> Message: 5
> Date: Wed, 9 Jul 2014 17:55:17 +0100
> From: Laura James <laura.james at okfn.org>
> To: "Mailing list for the School of Data, a joint initiative of the
>         OKFN and P2PU" <school-of-data at lists.okfn.org>
> Subject: Re: [School-of-data] BIG data... What is BIG?
> Message-ID:
>         <CADfaCdcQGcuUMP2H3DYRx9ZX5LXfPVM=
> JobKHYubqfcxyfJP8A at mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> On 9 July 2014 16:59, Stefan Urbanek <stefan.urbanek at gmail.com> wrote:
>
> >
> > Concerning Open-Data, I yet have to see a dataset AND a problem to say
> > that it is a BIG data problem. If you know about any, I would be more
> than
> > happy to know about it. I would love to see one and touch it.
> >
>
> Depending on the definition of big data, something like the Ensembl human
> genome data? (210GB annotated version)
>
> I imagine there may also be some pretty large NASA-generated open data
> sets.
>
> Laura
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <
> http://lists.okfn.org/pipermail/school-of-data/attachments/20140709/0987b639/attachment.html
> >
>
> ------------------------------
>
> Subject: Digest Footer
>
> _______________________________________________
> school-of-data mailing list
> school-of-data at lists.okfn.org
> https://lists.okfn.org/mailman/listinfo/school-of-data
> Unsubscribe: https://lists.okfn.org/mailman/optionss/school-of-data
>
> ------------------------------
>
> End of school-of-data Digest, Vol 24, Issue 4
> *********************************************
>



-- 
E.Christy Okpo
ecokpo at gmail.com
909-801-4408
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/school-of-data/attachments/20140709/76a49258/attachment-0001.html>


More information about the school-of-data mailing list