[ckan-dev] Dat Integration

Karissa McKelvey karissa.mckelvey at gmail.com
Tue Sep 15 03:37:50 UTC 2015


Dat has the ability to write binary files to disk, so you don't need
to supply a key. In that mode, dat doesn't parse the data using a
tabular parser e.g, csv, json, xlsx.

In that scenario, though, it can't do diffs by row. If the user
supplies the key, probably best to do this through the UI, then it
could version the differences by rows in the table.

On Mon, Sep 14, 2015 at 7:50 PM, Marianne Bellotti
<marianne at exversion.com> wrote:
> Well since Joel is going to pull me into this discussion I might as well
> give my thoughts :)
>
> The one thing I keep coming back to with Dat integration for CKAN is keys.
> How will CKAN know which column in the dataset to use as the primary key for
> Dat's version control? Without assigning a key any new versions will just be
> added as new data in Dat so it's sort of an important thing.
>
> Does CKAN guess it? Does the user assign it through the resource form? If
> the user assigns it how do you communicate to non-technical people exactly
> what information they are supposed to supply? What happens when someone does
> something wrong? Is version control turned off or just allowed to run
> incorrectly? What happens if the data doesn't actually have a unique key
> that the user can assign?
>
> I still have not come up with good answers to these questions.
>
> -Marianne
>
> On Mon, Sep 14, 2015 at 10:20 PM, <ckan-dev-request at lists.okfn.org> wrote:
>>
>> Send ckan-dev mailing list submissions to
>>         ckan-dev at lists.okfn.org
>>
>> To subscribe or unsubscribe via the World Wide Web, visit
>>         https://lists.okfn.org/mailman/listinfo/ckan-dev
>> or, via email, send a message with subject or body 'help' to
>>         ckan-dev-request at lists.okfn.org
>>
>> You can reach the person managing the list at
>>         ckan-dev-owner at lists.okfn.org
>>
>> When replying, please edit your Subject line so it is more specific
>> than "Re: Contents of ckan-dev digest..."
>>
>>
>> Today's Topics:
>>
>>    1. Re: Future, flask, breaking things, funding. (Steven De Costa)
>>
>>
>> ----------------------------------------------------------------------
>>
>> Message: 1
>> Date: Tue, 15 Sep 2015 12:20:26 +1000
>> From: Steven De Costa <steven.decosta at linkdigital.com.au>
>> To: CKAN Development Discussions <ckan-dev at lists.okfn.org>
>> Subject: Re: [ckan-dev] Future, flask, breaking things, funding.
>> Message-ID:
>>
>> <CAMp=Osb76LGU294W4wcUnJpG0ZO2us6yUdPXJ3Sg85Q+sFZ9PQ at mail.gmail.com>
>> Content-Type: text/plain; charset="utf-8"
>>
>> I'll be in San Francisco 4-6 October if you wanted to catch up and look at
>> it together Karissa?
>>
>> I also have some thoughts about remaining flexible in the storage types
>> that CKAN might support. Basically, it would be nice if these were
>> abstracted into an API and created via the admin as provisioning requests.
>> This would allow a platform to provision a variety of storage options and
>> enable them at a resource level similar to the resource views at the UI
>> level. It would also allow for network level security models to be
>> employed, or data storage sovereignty to be maintained in accordance to to
>> jurisdictional or security classification. Maybe we could call these
>> resource containers?
>>
>> Happy to catch up with anyone in SF re CKAN :) In fact, happy to run a
>> meetup there if there is interest... physical + video conference.
>>
>> I'm in Vegas for re:Invent on the 7th to 9th too :) It would be good to
>> form a huddle of CKANers at the re:Play party on the 8th!
>>
>> Cheers,
>> Steven
>>
>> *STEVEN DE COSTA *|
>> *EXECUTIVE DIRECTOR*www.linkdigital.com.au
>>
>>
>>
>> On 15 September 2015 at 08:41, Karissa McKelvey
>> <karissa.mckelvey at gmail.com>
>> wrote:
>>
>> > I think Dat would be a great way to allow programmatic access to
>> > datasets
>> > in CKAN. Dat handles streaming data very well. I imagine being able to
>> > replace the `.csv` with a `.dat` and get streaming and incremental
>> > uploads
>> > and downloads.
>> >
>> > Dat has a two-phase sync process, the first computes the differences
>> > between the local and remote copy, and the second syncs the data that is
>> > different. This leads to users never having to download the same data
>> > twice, and reduces bandwidth costs for the host. Because dat knows the
>> > differences between each data version, it is also a really lightweight
>> > way
>> > to see and overview of previous data versions for a single dataset.
>> >
>> > I'd be happy to chat more about how this might work in practice!
>> >
>> > Cheers,
>> >
>> > On Mon, Sep 14, 2015 at 3:37 PM, Karissa McKelvey <
>> > karissa.mckelvey at gmail.com> wrote:
>> >
>> >> I think Dat would be a great way to allow programmatic access to
>> >> datasets
>> >> in CKAN. Dat handles streaming data very well. I imagine being able to
>> >> replace the `.csv` with a `.dat` and get streaming and incremental
>> >> uploads
>> >> and downloads.
>> >>
>> >> Dat has a two-phase sync process, the first computes the differences
>> >> between the local and remote copy, and the second syncs the data that
>> >> is
>> >> different. This leads to users never having to download the same data
>> >> twice, and reduces bandwidth costs for the host. Because dat knows the
>> >> differences between each data version, it is also a really lightweight
>> >> way
>> >> to see and overview of previous data versions for a single dataset.
>> >>
>> >> I'd be happy to chat more about how this might work in practice!
>> >>
>> >> Cheers,
>> >>
>> >>
>> >> On Mon, Sep 14, 2015 at 2:51 PM, Joel Natividad <
>> >> joel.natividad at ontodia.com> wrote:
>> >>
>> >>> Hi all,
>> >>> What about integrating with Dat <http://dat-data.com>?
>> >>>
>> >>> It handles streaming data; can handle huge datasets; can do deltas (no
>> >>> need to re-download a huge dataset over and over again) ; has versions
>> >>> (not
>> >>> just revisions as data consumers have legitimate reasons to use
>> >>> different
>> >>> versions of data, down to the row level), and makes CKAN more
>> >>> "dog-fooding"
>> >>> friendly (i.e. publishers using it not only to publish data, but to
>> >>> actually build solutions ).
>> >>>
>> >>> Marianne Bellotti (CKAN-powered HDX) and I independently spent some
>> >>> quality time with Karissa McKelvey - one of the three key developers
>> >>> behind Dat <http://dat-data.com/team>, when she was in NYC last month
>> >>> and discussed at length how Dat + CKAN can work together.
>> >>>
>> >>> Karissa even put together a rough spec on a "ckanext-dat" extension.
>> >>>
>> >>> FYI, Dat is supported by usopendata.org
>> >>> <https://usopendata.org/2015/07/29/dat-beta/>, which also happens to
>> >>> be
>> >>> the org behind CKAN-Multisite, which was just announced as generally
>> >>> available today. <https://usopendata.org/2015/09/14/ckan-multisite/>
>> >>>
>> >>> Best,
>> >>> Joel
>> >>>
>> >>>
>> >>> --
>> >>> Joel Natividad
>> >>> +1 347-565-5635
>> >>> @jqnatividad
>> >>>
>> >>> Ontodia, Inc.
>> >>> 137 Varick Street, 2nd Floor, New York, NY 10013
>> >>>
>> >>> On Mon, Sep 14, 2015 at 5:11 PM, Steven De Costa <
>> >>> steven.decosta at linkdigital.com.au> wrote:
>> >>>
>> >>>> I'm 'all in' on this discussion :) I'll setup a doodle and we can
>> >>>> pick
>> >>>> a time to do a video call...
>> >>>>
>> >>>> My 2c on some points.
>> >>>>
>> >>>> 1. Perhaps redev could be bottom up. Start with resources and widen
>> >>>> its
>> >>>> ability. Crud can then be rebuilt over the top.
>> >>>> 2. Carefully consider the longest term possible and how the app may
>> >>>> mature in the future.
>> >>>> 3. Consider interoperability between n+1 platforms via linked open
>> >>>> data, again with realtime in mind
>> >>>> 4. Consider packages further. Could we add new package types that are
>> >>>> built on 3.0 thinking and have them co exist with current packages?
>> >>>> If so
>> >>>> then existing extensions could be modified less dramatically to apply
>> >>>> only
>> >>>> to v2 packages.
>> >>>> 5. Think about migration scenarios. Could a v2 CKAN remain as a dumb
>> >>>> web app harvesting from a 3.0? If so, we could priorities workflows
>> >>>> around
>> >>>> custodians and ETL before end users.
>> >>>> 6. Yes I'm sure others in the steering group would support the work.
>> >>>> Just remember they are also just volunteers :)
>> >>>> 7. Yes I'm sure funding could come from the Association, just so long
>> >>>> as funding first goes into the association. So, we'd all have a part
>> >>>> to
>> >>>> play in signing up paying members - happy to take any leads from
>> >>>> people on
>> >>>> that point :)
>> >>>>
>> >>>> Hoots!
>> >>>>
>> >>>>
>> >>>> On Tuesday, September 15, 2015, Denis Zgonjanin <
>> >>>> deniszgonjanin at gmail.com> wrote:
>> >>>>
>> >>>>> Yes, we should think of use cases. Realtime data is just one. I'm
>> >>>>> not
>> >>>>> just talking about things we might want to do. Here are the current
>> >>>>> things
>> >>>>> in CKAN that would benefit from better asynchronous support:
>> >>>>>
>> >>>>> - Datastore & Datapusher. We could integrate datapusher into CKAN,
>> >>>>> so
>> >>>>> people don't need to set up an additional web service just to use
>> >>>>> stock
>> >>>>> CKAN.
>> >>>>> - Harvesting. Set up a periodic callback that calls harvest sources
>> >>>>> every hour. Super easy when compared to having to set up
>> >>>>> reddit/ZeroMQ, and
>> >>>>> another 3(!) long-running processes running in the background.
>> >>>>> - Webhooks. They must be pushed off to a celery queue because of
>> >>>>> Pylons. With async they could be fired off easily.
>> >>>>> - Analytics & analytics reports; Sending automated emails and other
>> >>>>> automated tasks.
>> >>>>> - Anything where right now we have to set up cron jobs.
>> >>>>>
>> >>>>> And probably most importantly - CKAN is going to need a face lift
>> >>>>> eventually if it's to remain relevant. It can't be stuck in CRUD
>> >>>>> land
>> >>>>> forever. There is plenty of time for this, no rush. But building
>> >>>>> cool
>> >>>>> shinny new things with fancy front-end javascript would be hard
>> >>>>> right now.
>> >>>>> It will be hard on any web framework built on the idea that your
>> >>>>> whole
>> >>>>> application context is transferred to the user on every HTTP
>> >>>>> request, and
>> >>>>> that nothing else except that is going on in the backend.
>> >>>>>
>> >>>>>
>> >>>>> On Mon, Sep 14, 2015 at 9:34 AM, St?phane Guidoin <
>> >>>>> stephane.guidoin at gmail.com> wrote:
>> >>>>>
>> >>>>>> *Now that government is (slowly) catching on, more stream, API, and
>> >>>>>> even real-time data is being published. CKAN doesn't do a great job
>> >>>>>> here.
>> >>>>>> The biggest obstacle to creating nice extensions to CKAN for
>> >>>>>> non-file data
>> >>>>>> is that Pylons is still firmly stuck within the HTTP
>> >>>>>> request-response
>> >>>>>> lifecycle. *
>> >>>>>>
>> >>>>>> I wonder what should be the role of CKAN when it comes to APIs,
>> >>>>>> streams and other things. Those stuff tend to be fairly resource
>> >>>>>> intensive
>> >>>>>> and most of the time, they are developed and hosted on their own,
>> >>>>>> not on
>> >>>>>> the open data portal. So what should be the role of CKAN on this?
>> >>>>>> How much
>> >>>>>> do we want to be able to integrate CKAN with APIs and streams, what
>> >>>>>> should
>> >>>>>> it give?
>> >>>>>>
>> >>>>>> From my point of view, moving to Flask or other, framework is
>> >>>>>> mostly
>> >>>>>> a question of technical debt (
>> >>>>>> https://18f.gsa.gov/2015/08/07/technical-debt-1/) and making sure
>> >>>>>> CKAN remains flexible (and build-in async would indeed help)
>> >>>>>>
>> >>>>>> When it comes to see how to support realtime data, even if it's to
>> >>>>>> mainly enable extension development, some thinking about use case
>> >>>>>> is needed
>> >>>>>> in order to avoid jumping into something that would be very time
>> >>>>>> intensive
>> >>>>>> in terms of dev.
>> >>>>>>
>> >>>>>> St?phane
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> On 2015-09-14 08:57, Denis Zgonjanin wrote:
>> >>>>>>
>> >>>>>> Right now CKAN is great for static sources of data, which is really
>> >>>>>> all that existed from government sources when CKAN was first
>> >>>>>> written.
>> >>>>>>
>> >>>>>> Now that government is (slowly) catching on, more stream, API, and
>> >>>>>> even real-time data is being published. CKAN doesn't do a great job
>> >>>>>> here.
>> >>>>>> The biggest obstacle to creating nice extensions to CKAN for
>> >>>>>> non-file data
>> >>>>>> is that Pylons is still firmly stuck within the HTTP
>> >>>>>> request-response
>> >>>>>> lifecycle.
>> >>>>>>
>> >>>>>> This worked well for CRUD apps, but now is really showing it's
>> >>>>>> limitations. It's hard to do anything in CKAN that doesn't take
>> >>>>>> place
>> >>>>>> within the context of a user's HTTP request. If you want to do some
>> >>>>>> extra
>> >>>>>> data processing on the side, you have to use celery queues or
>> >>>>>> worse, cron.
>> >>>>>> Worse yet, some people do try to put extra processing inside the
>> >>>>>> request-response lifecycle, causing problems.
>> >>>>>>
>> >>>>>> Even core CKAN is guilty of this. For example, CKAN will call
>> >>>>>> datapusher to send upload jobs and retrieve job results, and those
>> >>>>>> requests
>> >>>>>> to datapusher happen while the user is waiting for the request to
>> >>>>>> return.
>> >>>>>> This is kind of terrible. Not even because somebody did it this
>> >>>>>> way, but
>> >>>>>> because CKAN doesn't give you a sane alternative to do it properly.
>> >>>>>>
>> >>>>>> Porting CKAN to flask is no small feat, so let's make sure we do it
>> >>>>>> right. Now that we're not using CKAN to just host static files
>> >>>>>> anymore, we
>> >>>>>> need to have better, built-in async support in CKAN. Perhaps this
>> >>>>>> means
>> >>>>>> moving to Python 3 where we'll have asyncio (and hopefully a future
>> >>>>>> version
>> >>>>>> of flask will work well with it). Other frameworks, like tornado,
>> >>>>>> are also
>> >>>>>> quite lightweight and support this out of the box for python 2.x.
>> >>>>>>
>> >>>>>> - Denis
>> >>>>>>
>> >>>>>>
>> >>>>>> On Mon, Sep 14, 2015 at 3:56 AM, Angelos Tzotsos <
>> >>>>>> gcpp.kalxas at gmail.com> wrote:
>> >>>>>>
>> >>>>>>> On 09/14/2015 10:24 AM, Ross Jones wrote:
>> >>>>>>>
>> >>>>>>>> Hi,
>> >>>>>>>>
>> >>>>>>>> I?ve recently been playing about with implementing parts of CKAN
>> >>>>>>>> in
>> >>>>>>>> Flask side-by-side with the current Pylons implementation. I?m
>> >>>>>>>> doing it
>> >>>>>>>> like this so that it isn?t immediately obvious that there?s a
>> >>>>>>>> migration
>> >>>>>>>> happening towards using Flask (aka nothing breaks).  I don?t
>> >>>>>>>> think this
>> >>>>>>>> branch should ever be merged, it?s more exploratory but it has
>> >>>>>>>> raised some
>> >>>>>>>> questions that I think it would be good to discuss.
>> >>>>>>>>
>> >>>>>>>> WARNING:anecdata
>> >>>>>>>> It?s pretty clear that the vast majority of people asked would
>> >>>>>>>> like
>> >>>>>>>> to move to Flask as a replacement for some layers of the system
>> >>>>>>>> (leaving
>> >>>>>>>> things like logic and plugins alone).
>> >>>>>>>> ENDWARNING
>> >>>>>>>>
>> >>>>>>>> We?ve discussed at the tech-team meetings, but I think a longer,
>> >>>>>>>> more accessible conversation would be beneficial.
>> >>>>>>>>
>> >>>>>>>> 1. What version of CKAN should be targeted? Common sense suggests
>> >>>>>>>> 3.0, but that being the case, exactly how far can we go in
>> >>>>>>>> breaking some
>> >>>>>>>> backward compatibility?  This isn?t really a technical question -
>> >>>>>>>> would be
>> >>>>>>>> good to hear what the community would accept ?
>> >>>>>>>>
>> >>>>>>>> 2. Does it *really* need to be side-by-side?  Running Flask and
>> >>>>>>>> Pylons side-by-side means staying on Python 2 for another few
>> >>>>>>>> years
>> >>>>>>>> (because Pylons).  A reasonably deep incision and removal of
>> >>>>>>>> non-logic/non-plugin code would make a move to Py3 easier, but
>> >>>>>>>> with some
>> >>>>>>>> level of breakage in external plugins. Staying on 2 would mean a
>> >>>>>>>> move to 3
>> >>>>>>>> at a later date and more pain.
>> >>>>>>>>
>> >>>>>>>> 3. Would the CKAN Association like to fund someone to do some of
>> >>>>>>>> this work? This is just one of several ideas mentioned on
>> >>>>>>>> https://github.com/ckan/ideas-and-roadmap/issues/152 that really
>> >>>>>>>> needs to be done if CKAN is going to thrive instead of just
>> >>>>>>>> survive.
>> >>>>>>>>
>> >>>>>>>> Any feedback welcome?
>> >>>>>>>>
>> >>>>>>>> Cheers
>> >>>>>>>>
>> >>>>>>>> Ross.
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> _______________________________________________
>> >>>>>>>> ckan-dev mailing list
>> >>>>>>>> ckan-dev at lists.okfn.org
>> >>>>>>>> https://lists.okfn.org/mailman/listinfo/ckan-dev
>> >>>>>>>> Unsubscribe: https://lists.okfn.org/mailman/options/ckan-dev
>> >>>>>>>>
>> >>>>>>>
>> >>>>>>> Hi Ross,
>> >>>>>>>
>> >>>>>>> I believe that a Flask port (or rewrite) is an excellent idea for
>> >>>>>>> CKAN 3.0 in order to support Python 3.x
>> >>>>>>> The alternative would be to port Pylons to Python 3.x, which
>> >>>>>>> perhaps
>> >>>>>>> is a more difficult task...
>> >>>>>>>
>> >>>>>>> Given that Python 2.x will EOL relatively soon, CKAN should move
>> >>>>>>> forward.
>> >>>>>>>
>> >>>>>>> Just my 2 cents.
>> >>>>>>>
>> >>>>>>> Best,
>> >>>>>>> Angelos
>> >>>>>>>
>> >>>>>>> --
>> >>>>>>> Angelos Tzotsos, PhD
>> >>>>>>> OSGeo Charter Member
>> >>>>>>> http://users.ntua.gr/tzotsos
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> _______________________________________________
>> >>>>>>> ckan-dev mailing list
>> >>>>>>> ckan-dev at lists.okfn.org
>> >>>>>>> https://lists.okfn.org/mailman/listinfo/ckan-dev
>> >>>>>>> Unsubscribe: https://lists.okfn.org/mailman/options/ckan-dev
>> >>>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> _______________________________________________
>> >>>>>> ckan-dev mailing
>> >>>>>> listckan-dev at lists.okfn.orghttps://lists.okfn.org/mailman/listinfo/ckan-dev
>> >>>>>> Unsubscribe: https://lists.okfn.org/mailman/options/ckan-dev
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> _______________________________________________
>> >>>>>> ckan-dev mailing list
>> >>>>>> ckan-dev at lists.okfn.org
>> >>>>>> https://lists.okfn.org/mailman/listinfo/ckan-dev
>> >>>>>> Unsubscribe: https://lists.okfn.org/mailman/options/ckan-dev
>> >>>>>>
>> >>>>>>
>> >>>>>
>> >>>>
>> >>>> --
>> >>>> *STEVEN DE COSTA *|
>> >>>> *EXECUTIVE DIRECTOR*www.linkdigital.com.au
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>> _______________________________________________
>> >>>> ckan-dev mailing list
>> >>>> ckan-dev at lists.okfn.org
>> >>>> https://lists.okfn.org/mailman/listinfo/ckan-dev
>> >>>> Unsubscribe: https://lists.okfn.org/mailman/options/ckan-dev
>> >>>>
>> >>>>
>> >>>
>> >>
>> >>
>> >> --
>> >> Karissa McKelvey
>> >> http://karissa.github.io/ <http://karissamck.com>
>> >>
>> >>
>> >
>> >
>> > --
>> > Karissa McKelvey
>> > http://karissa.github.io/
>> >
>> > _______________________________________________
>> > ckan-dev mailing list
>> > ckan-dev at lists.okfn.org
>> > https://lists.okfn.org/mailman/listinfo/ckan-dev
>> > Unsubscribe: https://lists.okfn.org/mailman/options/ckan-dev
>> >
>> >
>> -------------- next part --------------
>> An HTML attachment was scrubbed...
>> URL:
>> <http://lists.okfn.org/pipermail/ckan-dev/attachments/20150915/68e58c70/attachment.html>
>>
>> ------------------------------
>>
>> Subject: Digest Footer
>>
>> _______________________________________________
>> ckan-dev mailing list
>> ckan-dev at lists.okfn.org
>> https://lists.okfn.org/mailman/listinfo/ckan-dev
>> Unsubscribe: https://lists.okfn.org/mailman/options/ckan-dev
>>
>>
>> ------------------------------
>>
>> End of ckan-dev Digest, Vol 59, Issue 33
>> ****************************************
>
>
>
> _______________________________________________
> ckan-dev mailing list
> ckan-dev at lists.okfn.org
> https://lists.okfn.org/mailman/listinfo/ckan-dev
> Unsubscribe: https://lists.okfn.org/mailman/options/ckan-dev
>



-- 
Karissa McKelvey
http://karissa.github.io/



More information about the ckan-dev mailing list