[ckan-dev] Future, flask, breaking things, funding.

Karissa McKelvey karissa.mckelvey at gmail.com
Mon Sep 14 22:41:55 UTC 2015


I think Dat would be a great way to allow programmatic access to datasets
in CKAN. Dat handles streaming data very well. I imagine being able to
replace the `.csv` with a `.dat` and get streaming and incremental uploads
and downloads.

Dat has a two-phase sync process, the first computes the differences
between the local and remote copy, and the second syncs the data that is
different. This leads to users never having to download the same data
twice, and reduces bandwidth costs for the host. Because dat knows the
differences between each data version, it is also a really lightweight way
to see and overview of previous data versions for a single dataset.

I'd be happy to chat more about how this might work in practice!

Cheers,

On Mon, Sep 14, 2015 at 3:37 PM, Karissa McKelvey <
karissa.mckelvey at gmail.com> wrote:

> I think Dat would be a great way to allow programmatic access to datasets
> in CKAN. Dat handles streaming data very well. I imagine being able to
> replace the `.csv` with a `.dat` and get streaming and incremental uploads
> and downloads.
>
> Dat has a two-phase sync process, the first computes the differences
> between the local and remote copy, and the second syncs the data that is
> different. This leads to users never having to download the same data
> twice, and reduces bandwidth costs for the host. Because dat knows the
> differences between each data version, it is also a really lightweight way
> to see and overview of previous data versions for a single dataset.
>
> I'd be happy to chat more about how this might work in practice!
>
> Cheers,
>
>
> On Mon, Sep 14, 2015 at 2:51 PM, Joel Natividad <
> joel.natividad at ontodia.com> wrote:
>
>> Hi all,
>> What about integrating with Dat <http://dat-data.com>?
>>
>> It handles streaming data; can handle huge datasets; can do deltas (no
>> need to re-download a huge dataset over and over again) ; has versions (not
>> just revisions as data consumers have legitimate reasons to use different
>> versions of data, down to the row level), and makes CKAN more "dog-fooding"
>> friendly (i.e. publishers using it not only to publish data, but to
>> actually build solutions ).
>>
>> Marianne Bellotti (CKAN-powered HDX) and I independently spent some
>> quality time with Karissa McKelvey - one of the three key developers
>> behind Dat <http://dat-data.com/team>, when she was in NYC last month
>> and discussed at length how Dat + CKAN can work together.
>>
>> Karissa even put together a rough spec on a "ckanext-dat" extension.
>>
>> FYI, Dat is supported by usopendata.org
>> <https://usopendata.org/2015/07/29/dat-beta/>, which also happens to be
>> the org behind CKAN-Multisite, which was just announced as generally
>> available today. <https://usopendata.org/2015/09/14/ckan-multisite/>
>>
>> Best,
>> Joel
>>
>>
>> --
>> Joel Natividad
>> +1 347-565-5635
>> @jqnatividad
>>
>> Ontodia, Inc.
>> 137 Varick Street, 2nd Floor, New York, NY 10013
>>
>> On Mon, Sep 14, 2015 at 5:11 PM, Steven De Costa <
>> steven.decosta at linkdigital.com.au> wrote:
>>
>>> I'm 'all in' on this discussion :) I'll setup a doodle and we can pick a
>>> time to do a video call...
>>>
>>> My 2c on some points.
>>>
>>> 1. Perhaps redev could be bottom up. Start with resources and widen its
>>> ability. Crud can then be rebuilt over the top.
>>> 2. Carefully consider the longest term possible and how the app may
>>> mature in the future.
>>> 3. Consider interoperability between n+1 platforms via linked open data,
>>> again with realtime in mind
>>> 4. Consider packages further. Could we add new package types that are
>>> built on 3.0 thinking and have them co exist with current packages? If so
>>> then existing extensions could be modified less dramatically to apply only
>>> to v2 packages.
>>> 5. Think about migration scenarios. Could a v2 CKAN remain as a dumb web
>>> app harvesting from a 3.0? If so, we could priorities workflows around
>>> custodians and ETL before end users.
>>> 6. Yes I'm sure others in the steering group would support the work.
>>> Just remember they are also just volunteers :)
>>> 7. Yes I'm sure funding could come from the Association, just so long as
>>> funding first goes into the association. So, we'd all have a part to play
>>> in signing up paying members - happy to take any leads from people on that
>>> point :)
>>>
>>> Hoots!
>>>
>>>
>>> On Tuesday, September 15, 2015, Denis Zgonjanin <
>>> deniszgonjanin at gmail.com> wrote:
>>>
>>>> Yes, we should think of use cases. Realtime data is just one. I'm not
>>>> just talking about things we might want to do. Here are the current things
>>>> in CKAN that would benefit from better asynchronous support:
>>>>
>>>> - Datastore & Datapusher. We could integrate datapusher into CKAN, so
>>>> people don't need to set up an additional web service just to use stock
>>>> CKAN.
>>>> - Harvesting. Set up a periodic callback that calls harvest sources
>>>> every hour. Super easy when compared to having to set up reddit/ZeroMQ, and
>>>> another 3(!) long-running processes running in the background.
>>>> - Webhooks. They must be pushed off to a celery queue because of
>>>> Pylons. With async they could be fired off easily.
>>>> - Analytics & analytics reports; Sending automated emails and other
>>>> automated tasks.
>>>> - Anything where right now we have to set up cron jobs.
>>>>
>>>> And probably most importantly - CKAN is going to need a face lift
>>>> eventually if it's to remain relevant. It can't be stuck in CRUD land
>>>> forever. There is plenty of time for this, no rush. But building cool
>>>> shinny new things with fancy front-end javascript would be hard right now.
>>>> It will be hard on any web framework built on the idea that your whole
>>>> application context is transferred to the user on every HTTP request, and
>>>> that nothing else except that is going on in the backend.
>>>>
>>>>
>>>> On Mon, Sep 14, 2015 at 9:34 AM, Stéphane Guidoin <
>>>> stephane.guidoin at gmail.com> wrote:
>>>>
>>>>> *Now that government is (slowly) catching on, more stream, API, and
>>>>> even real-time data is being published. CKAN doesn't do a great job here.
>>>>> The biggest obstacle to creating nice extensions to CKAN for non-file data
>>>>> is that Pylons is still firmly stuck within the HTTP request-response
>>>>> lifecycle. *
>>>>>
>>>>> I wonder what should be the role of CKAN when it comes to APIs,
>>>>> streams and other things. Those stuff tend to be fairly resource intensive
>>>>> and most of the time, they are developed and hosted on their own, not on
>>>>> the open data portal. So what should be the role of CKAN on this? How much
>>>>> do we want to be able to integrate CKAN with APIs and streams, what should
>>>>> it give?
>>>>>
>>>>> From my point of view, moving to Flask or other, framework is mostly a
>>>>> question of technical debt (
>>>>> https://18f.gsa.gov/2015/08/07/technical-debt-1/) and making sure
>>>>> CKAN remains flexible (and build-in async would indeed help)
>>>>>
>>>>> When it comes to see how to support realtime data, even if it's to
>>>>> mainly enable extension development, some thinking about use case is needed
>>>>> in order to avoid jumping into something that would be very time intensive
>>>>> in terms of dev.
>>>>>
>>>>> Stéphane
>>>>>
>>>>>
>>>>>
>>>>> On 2015-09-14 08:57, Denis Zgonjanin wrote:
>>>>>
>>>>> Right now CKAN is great for static sources of data, which is really
>>>>> all that existed from government sources when CKAN was first written.
>>>>>
>>>>> Now that government is (slowly) catching on, more stream, API, and
>>>>> even real-time data is being published. CKAN doesn't do a great job here.
>>>>> The biggest obstacle to creating nice extensions to CKAN for non-file data
>>>>> is that Pylons is still firmly stuck within the HTTP request-response
>>>>> lifecycle.
>>>>>
>>>>> This worked well for CRUD apps, but now is really showing it's
>>>>> limitations. It's hard to do anything in CKAN that doesn't take place
>>>>> within the context of a user's HTTP request. If you want to do some extra
>>>>> data processing on the side, you have to use celery queues or worse, cron.
>>>>> Worse yet, some people do try to put extra processing inside the
>>>>> request-response lifecycle, causing problems.
>>>>>
>>>>> Even core CKAN is guilty of this. For example, CKAN will call
>>>>> datapusher to send upload jobs and retrieve job results, and those requests
>>>>> to datapusher happen while the user is waiting for the request to return.
>>>>> This is kind of terrible. Not even because somebody did it this way, but
>>>>> because CKAN doesn't give you a sane alternative to do it properly.
>>>>>
>>>>> Porting CKAN to flask is no small feat, so let's make sure we do it
>>>>> right. Now that we're not using CKAN to just host static files anymore, we
>>>>> need to have better, built-in async support in CKAN. Perhaps this means
>>>>> moving to Python 3 where we'll have asyncio (and hopefully a future version
>>>>> of flask will work well with it). Other frameworks, like tornado, are also
>>>>> quite lightweight and support this out of the box for python 2.x.
>>>>>
>>>>> - Denis
>>>>>
>>>>>
>>>>> On Mon, Sep 14, 2015 at 3:56 AM, Angelos Tzotsos <
>>>>> gcpp.kalxas at gmail.com> wrote:
>>>>>
>>>>>> On 09/14/2015 10:24 AM, Ross Jones wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I’ve recently been playing about with implementing parts of CKAN in
>>>>>>> Flask side-by-side with the current Pylons implementation. I’m doing it
>>>>>>> like this so that it isn’t immediately obvious that there’s a migration
>>>>>>> happening towards using Flask (aka nothing breaks).  I don’t think this
>>>>>>> branch should ever be merged, it’s more exploratory but it has raised some
>>>>>>> questions that I think it would be good to discuss.
>>>>>>>
>>>>>>> WARNING:anecdata
>>>>>>> It’s pretty clear that the vast majority of people asked would like
>>>>>>> to move to Flask as a replacement for some layers of the system (leaving
>>>>>>> things like logic and plugins alone).
>>>>>>> ENDWARNING
>>>>>>>
>>>>>>> We’ve discussed at the tech-team meetings, but I think a longer,
>>>>>>> more accessible conversation would be beneficial.
>>>>>>>
>>>>>>> 1. What version of CKAN should be targeted? Common sense suggests
>>>>>>> 3.0, but that being the case, exactly how far can we go in breaking some
>>>>>>> backward compatibility?  This isn’t really a technical question - would be
>>>>>>> good to hear what the community would accept …
>>>>>>>
>>>>>>> 2. Does it *really* need to be side-by-side?  Running Flask and
>>>>>>> Pylons side-by-side means staying on Python 2 for another few years
>>>>>>> (because Pylons).  A reasonably deep incision and removal of
>>>>>>> non-logic/non-plugin code would make a move to Py3 easier, but with some
>>>>>>> level of breakage in external plugins. Staying on 2 would mean a move to 3
>>>>>>> at a later date and more pain.
>>>>>>>
>>>>>>> 3. Would the CKAN Association like to fund someone to do some of
>>>>>>> this work? This is just one of several ideas mentioned on
>>>>>>> https://github.com/ckan/ideas-and-roadmap/issues/152 that really
>>>>>>> needs to be done if CKAN is going to thrive instead of just survive.
>>>>>>>
>>>>>>> Any feedback welcome…
>>>>>>>
>>>>>>> Cheers
>>>>>>>
>>>>>>> Ross.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> ckan-dev mailing list
>>>>>>> ckan-dev at lists.okfn.org
>>>>>>> https://lists.okfn.org/mailman/listinfo/ckan-dev
>>>>>>> Unsubscribe: https://lists.okfn.org/mailman/options/ckan-dev
>>>>>>>
>>>>>>
>>>>>> Hi Ross,
>>>>>>
>>>>>> I believe that a Flask port (or rewrite) is an excellent idea for
>>>>>> CKAN 3.0 in order to support Python 3.x
>>>>>> The alternative would be to port Pylons to Python 3.x, which perhaps
>>>>>> is a more difficult task...
>>>>>>
>>>>>> Given that Python 2.x will EOL relatively soon, CKAN should move
>>>>>> forward.
>>>>>>
>>>>>> Just my 2 cents.
>>>>>>
>>>>>> Best,
>>>>>> Angelos
>>>>>>
>>>>>> --
>>>>>> Angelos Tzotsos, PhD
>>>>>> OSGeo Charter Member
>>>>>> http://users.ntua.gr/tzotsos
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> ckan-dev mailing list
>>>>>> ckan-dev at lists.okfn.org
>>>>>> https://lists.okfn.org/mailman/listinfo/ckan-dev
>>>>>> Unsubscribe: https://lists.okfn.org/mailman/options/ckan-dev
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> ckan-dev mailing listckan-dev at lists.okfn.orghttps://lists.okfn.org/mailman/listinfo/ckan-dev
>>>>> Unsubscribe: https://lists.okfn.org/mailman/options/ckan-dev
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> ckan-dev mailing list
>>>>> ckan-dev at lists.okfn.org
>>>>> https://lists.okfn.org/mailman/listinfo/ckan-dev
>>>>> Unsubscribe: https://lists.okfn.org/mailman/options/ckan-dev
>>>>>
>>>>>
>>>>
>>>
>>> --
>>> *STEVEN DE COSTA *|
>>> *EXECUTIVE DIRECTOR*www.linkdigital.com.au
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> ckan-dev mailing list
>>> ckan-dev at lists.okfn.org
>>> https://lists.okfn.org/mailman/listinfo/ckan-dev
>>> Unsubscribe: https://lists.okfn.org/mailman/options/ckan-dev
>>>
>>>
>>
>
>
> --
> Karissa McKelvey
> http://karissa.github.io/ <http://karissamck.com>
>
>


-- 
Karissa McKelvey
http://karissa.github.io/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/ckan-dev/attachments/20150914/19f5fbf0/attachment-0003.html>


More information about the ckan-dev mailing list