[ckan-dev] road map for horizontal scaling?

Thu Oct 13 23:29:31 UTC 2016

Attaching our high level architecture using RDS on AWS --- for UAT and PROD:

CloudFormation scripts for building out CKAN in a HA config can be found at
https://github.com/DataShades/ckan-aws-templates

OpWorks version is here: https://github.com/DataShades/opswx-ckan-cookbook

Happy to collaborate on this and make it shine brighter :)

There are a few other relevant scripts under our datashades set of repos,
such as the ASG one here: https://github.com/DataShades/updateasg

And, the general cloud storage one here:
https://github.com/DataShades/ckanext-cloudstorage

And the S3 related one here:
https://github.com/DataShades/ckanext-s3filestore

We've also improved the SSO approach with Saml2:
https://github.com/DataShades/ckanext-saml2

And, begun some work for manipulating ACLs, which is important for private
dataset resources you'd want to switch to 'public' when published:
https://github.com/DataShades/ckanext-acl

Although not formally part of the CKAN roadmap I have a working model of
where I'd like CKAN to head when it comes to enterprise file/data storage
and access. If you are familiar with the concept of resource views then the
idea I'm keen to pursue is similar. It is a concept of resource containers
(not para-virtualization containers but storage or access point
containers). The idea is to make CKAN extendable via extensions of a type
that allow it to do more orchestration around how data is stored and made
usable below the discovery layer of the metadata.

The story would be something like:
As a platform operator, I need to be able to configure a variety of storage
and access endpoint possibilities, so that custodians can select where data
is placed based on type of data or business need.

Resource container extensions would then be built to accommodate things
like:

   1. Big data, transnational data feeds
   2. Semantic lakes
   3. Large file storage blobs
   4. Self declarative structured data (likely using data
   packaging/frictionless data)
   5. For cost auditing and accountability - storage into specified paid
   cloud accounts (different AWS, Azure, etc. accounts based on organisation)

I would image that resource view and resource container extensions would be
paired in many cases to allow for the view to provide greater access and
control of the data to provide an ability to query and extract insights
from the data.

The European Data Portal has around 650k datasets. It is true that once a
CKAN portal gets to such a size then it can be a chore to do anything over
the entire set of data in quick time. However, with the entire catalog
readable via API there is a place for other tools to come into the picture
to provide meta analysis or broader views over all data in a portal.

CKAN's structure allows for data ownership and custodianship to remain
flexible as the governing entities change over time. If we keen those
functions lightweight and build the more intensive data processing tasks
within a resource container layer then I think that is the big win :) I see
datastore and filestore as examples of resource containers. Datapusher is
an example of an ETL that works with datastore but similar tools and
concepts can be worked into the model and the open source goodness can grow
organically to meet lots of different organisational needs.

Where CKAN differs from other portal software, in my experience, is that it
can be used for open Government data, research data, private sector data
and 'data as knowledge' in virtually any situation. Other portal software
appears to be built around capturing a particular market opportunity to
generate data as knowledge for a particular customer segment - civic
hackers, jurisdictional bureaucrats, open data policy implementations, etc.

CKAN's harvesting is good, but certainly not perfect. The approach for
pushing from CKAN to elsewhere is likely to be used more in our future
work, or as we refactor the architecture of current implementations. See:
https://github.com/DataShades/ckanext-syndicate

By using multiple CKAN environments it is pretty easy to have catalogs of
'working data' that then push to the 'published data' catalog. We use this
approach for Government open data when from the bottom up you have agency
data collected into CKAN based information asset registers. Sometimes the
data doesn't even exist, but the data management plan can at least first be
registered prior to populating the dataset with resources. Once the data is
ready it can then be published and syndicated upward to a higher level
jurisdictional portal - such as a council, city, state or province.
Similarly such datasets can then be syndicated upward again into a national
or regional portal - perhaps with further ETL functions put in place to
combine the similarly structured data from multiple agencies into a master
dataset that presents a larger view of the entire data collection effort.

If the domain of data collection differs, such as in a field of research,
then the same architecture can still apply. Multiple research schools of
chemistry, for example, could publish working data locally then syndicate
upward into a global repository that allows for meta analysis of all
research outcomes over the entire domain's efforts. We're working on a
project in just this manner that is referenced here:
http://linkdigital.com.au/news/2016/09/building-mdbox-an-open-access-simulation-data-repository-on-ckan-and-aws

Lastly, published open data is the result of effort which is put into a
process of data collection and, usually, some analysis and clean up. The
tools used to process data, to prepare, collect or visulise are all part of
the value a dataset represents. To bridge data and code we've released a
very simple resource view for GitHub repositories that can be found here:
https://github.com/DataShades/ckanext-githubrepopreview

Open Government initiatives are formed around principles of transparency,
participation and collaboration. There is a desire to enable public-private
collaboration over the long term and there is a role for Government to act
as impresario to stimulate new markets and economic activity from
publishing open data (ref:
https://www.nesta.org.uk/sites/default/files/government_as_impresario.pdf).
The reason we built the GitHub resource view is to encourage open source
projects to emerge in connection to public datasets, via linking the
opportunity for discovery of helpful code with the discovery of helpful
datasets.

Sorry for the belated and long reply on this thread! I could have more
succinctly just said CKAN rocks, check out all the open source goodness
surrounding it and jump in :)

Cheers,
Steven

*STEVEN DE COSTA *|
*EXECUTIVE DIRECTOR*www.linkdigital.com.au

On 13 October 2016 at 13:48, Claire Herbert <Claire.Herbert at umanitoba.ca>
wrote:

> Sounds very interesting David. Would you be able share a high level
> diagram of the architecture? It sounds very useful in potentially helping
> organization like ours (University) plan a larger deployment.
>
>
> Claire
>
>
>
>
>
>
> ------------------------------
> *From:* Fawcett, David (MNIT) <David.Fawcett at state.mn.us>
> *Sent:* 06 October 2016 10:02
> *To:* CKAN Development Discussions
> *Subject:* Re: [ckan-dev] road map for horizontal scaling?
>
>
> RMX,
>
>
>
> We currently don’t store many, if any, of the datasets in the database.
> We put CKAN in front of an internal data distribution system, with our CKAN
> instance essentially becoming just another node on the system.  When a
> dataset is updated in the system, it gets pushed out to all designated
> nodes, and we run a script nightly to read dataset metadata and push
> new/updated records to CKAN via API.
>
>
>
> Here is an example dataset (we call them resources because they include
> web apps, desktops, and data):
>
> https://gisdata.mn.gov/dataset/env-buffer-protection-mn
> Buffer Protection Map, Minnesota - Resources - Minnesota Geospatial Commons
> These data represent public waters and public ditches that require
> permanent vegetation buffers or alternative riparian water quality
> practices. The buffer map data comprise two geographical...
> Read more... <https://gisdata.mn.gov/dataset/env-buffer-protection-mn>
>
>
>
> Most of the info on the page comes from the spatial metadata.  The
> overview text comes from the metadata Abstract element, the tags come from
> metadata key words, etc.
>
>
>
> Our state manages a lot of data in ESRI’s proprietary file geodatabase
> format, but to make the data accessible, we automatically generate
> shapefile and geopackage copies of the data and publish them as well.  This
> allows people to access the data without expensive licenses and proprietary
> software.
>
>
>
> In this example, you can also see that there is a link to view the full
> metadata record, and this resource has an associated Web map, so there is
> button to go there too.
>
>
>
> The file-based datasets and metadata documents are not stored on the same
> server as our CKAN instance.  They are on a different FTP server.  E.g.
> ftp://ftp.gisdata.mn.gov/pub/gdrs/data/pub/us_mn_state_dnr/
> env_buffer_protection_mn/shp_env_buffer_protection_mn.zip
>
>
>
> David.
>
>
>
>
>
>
>
> *From:* ckan-dev [mailto:ckan-dev-bounces at lists.okfn.org] *On Behalf Of *Ruima
> E.
> *Sent:* Thursday, October 06, 2016 1:08 AM
> *To:* CKAN Development Discussions <ckan-dev at lists.okfn.org>
> *Subject:* Re: [ckan-dev] road map for horizontal scaling?
>
>
>
> Thank you David,
>
>
>
> That is very good to know.
>
> All those datasets do they fit in one machine?
>
> Are you using postgreSQL to store the datasets, or just the metadata?
>
>
>
> Best regards,
>
> RMX
>
>
>
> On Thu, Oct 6, 2016 at 3:07 AM, Fawcett, David (MNIT) <
> David.Fawcett at state.mn.us> wrote:
>
> RMX,
>
> Our US state is running CKAN on Postgres.  We currently have about 600
> datasets, and we are not anywhere close to being limited by the database.
>
> data.gov has about 190,000 datasets and performs fine.
>
> David.
> ------------------------------
>
> *From:* ckan-dev [ckan-dev-bounces at lists.okfn.org] on behalf of Ruima E. [
> ruimaximo at gmail.com]
> *Sent:* Wednesday, October 05, 2016 2:40 PM
> *To:* CKAN Development Discussions
> *Subject:* Re: [ckan-dev] road map for horizontal scaling?
>
> Thank you Tim!
>
> I am asking these questions because I am considering installing a CKAN as
> a data hub for a city. It seems a very promising ideia but I am concerned
> that if tomorrow the number of datasets grows and we will need it to be
> distributed through several machines, the PosgreSQL might be a bottleneck
> and a headache.
>
> When I think about scale I have in mind the example of Hadoop. If tomorrow
> the datasets cannot fit one machine, just add one more node, edit a few
> text files and it works seamless. I am afraid that with PosgreSQL that is
> not the case, or am I wrong?
>
>
>
> Best regards,
>
> RMX
>
>
>
>
>
> On Wed, Oct 5, 2016 at 8:52 PM, Timothy Giles <timothy.giles at slu.se>
> wrote:
>
> Hi RMX.
>
>
>
> I wonder if you can give a concrete example of what you mean by scale?
> Since this is a dev forum/mailing list, I think it would helpful to
> quantify your issue(s) / conern(s). There are instances of CKAN with
> hundred of thousands and millions of datasets, as well as individual
> datasets being extremely large ('00s GBs).
>
>
>
> MvH Tim
>
>
>
>
>
>
> ------------------------------
>
> *From:* ckan-dev <ckan-dev-bounces at lists.okfn.org> on behalf of Ruima E. <
> ruimaximo at gmail.com>
> *Sent:* 05 October 2016 02:40 PM
> *To:* ckan-dev at lists.okfn.org
> *Subject:* [ckan-dev] road map for horizontal scaling?
>
>
>
> Hi,
>
>
>
> At the moment ckan relies on PostgreSQL as a data store. I was shocked
> when I found that such nice project relies on a data store that is not
> suitable to scale. Open data in smart cities is expected to be Big Data and
> it is expected to scale, jeopardizing the success of the whole initiative
> in a near future.
>
>
>
> Is scaling by using open source technologies part of the  road map for
> CKAN?
>
>
>
> Thank you,
>
> RMX
>
>
> _______________________________________________
> ckan-dev mailing list
> ckan-dev at lists.okfn.org
> https://lists.okfn.org/mailman/listinfo/ckan-dev
> Unsubscribe: https://lists.okfn.org/mailman/options/ckan-dev
>
>
>
>
> _______________________________________________
> ckan-dev mailing list
> ckan-dev at lists.okfn.org
> https://lists.okfn.org/mailman/listinfo/ckan-dev
> Unsubscribe: https://lists.okfn.org/mailman/options/ckan-dev
>
>
>
> _______________________________________________
> ckan-dev mailing list
> ckan-dev at lists.okfn.org
> https://lists.okfn.org/mailman/listinfo/ckan-dev
> Unsubscribe: https://lists.okfn.org/mailman/options/ckan-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/ckan-dev/attachments/20161014/fc2b44bc/attachment-0003.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Appendix_8_Updated_AWS Hosting Environment.pdf
Type: application/pdf
Size: 96249 bytes
Desc: not available
URL: <http://lists.okfn.org/pipermail/ckan-dev/attachments/20161014/fc2b44bc/attachment-0003.pdf>