[ckan-dev] road map for horizontal scaling?

Tue Oct 18 14:14:56 UTC 2016

Thanks very much for the detailed reply Steven! It helps clarify what kind of framework we can set up for ckan in our academic environment, where we serve users with working research datasets (private), shared datasets and publicly available datasets where we want better visualization,as well as some community based monitoring work.  Trying to describe the vision for ckan in the big picture to senior researchers and funders can be challenging. Plus this helps me visualize what ckan can perhaps do for us in the future so we can plan accordingly.

Re: the github repo. Would this work with a local server instance of gitlab.

Cheers,

Claire

Claire Herbert

Coordinator

Lake Winnipeg Basin Information Network

Centre For Earth Observation Science

Department of Environment and Geography

522 Wallace Building

University of Manitoba

Winnipeg, Canada, R3T 3N2

Phone: (204) 474-8657

Follow us on twitter - @LWBIN

Web: http://lwbi.cc.umanitoba.ca/

________________________________
From: Steven De Costa <steven.decosta at linkdigital.com.au>
Sent: 13 October 2016 18:29
To: CKAN Development Discussions
Subject: Re: [ckan-dev] road map for horizontal scaling?

Attaching our high level architecture using RDS on AWS --- for UAT and PROD:

CloudFormation scripts for building out CKAN in a HA config can be found at https://github.com/DataShades/ckan-aws-templates

OpWorks version is here: https://github.com/DataShades/opswx-ckan-cookbook

Happy to collaborate on this and make it shine brighter :)

There are a few other relevant scripts under our datashades set of repos, such as the ASG one here: https://github.com/DataShades/updateasg

And, the general cloud storage one here: https://github.com/DataShades/ckanext-cloudstorage

And the S3 related one here: https://github.com/DataShades/ckanext-s3filestore

We've also improved the SSO approach with Saml2: https://github.com/DataShades/ckanext-saml2

And, begun some work for manipulating ACLs, which is important for private dataset resources you'd want to switch to 'public' when published: https://github.com/DataShades/ckanext-acl

Although not formally part of the CKAN roadmap I have a working model of where I'd like CKAN to head when it comes to enterprise file/data storage and access. If you are familiar with the concept of resource views then the idea I'm keen to pursue is similar. It is a concept of resource containers (not para-virtualization containers but storage or access point containers). The idea is to make CKAN extendable via extensions of a type that allow it to do more orchestration around how data is stored and made usable below the discovery layer of the metadata.

The story would be something like:
As a platform operator, I need to be able to configure a variety of storage and access endpoint possibilities, so that custodians can select where data is placed based on type of data or business need.

Resource container extensions would then be built to accommodate things like:

  1.  Big data, transnational data feeds
  2.  Semantic lakes
  3.  Large file storage blobs
  4.  Self declarative structured data (likely using data packaging/frictionless data)
  5.  For cost auditing and accountability - storage into specified paid cloud accounts (different AWS, Azure, etc. accounts based on organisation)

I would image that resource view and resource container extensions would be paired in many cases to allow for the view to provide greater access and control of the data to provide an ability to query and extract insights from the data.

The European Data Portal has around 650k datasets. It is true that once a CKAN portal gets to such a size then it can be a chore to do anything over the entire set of data in quick time. However, with the entire catalog readable via API there is a place for other tools to come into the picture to provide meta analysis or broader views over all data in a portal.

CKAN's structure allows for data ownership and custodianship to remain flexible as the governing entities change over time. If we keen those functions lightweight and build the more intensive data processing tasks within a resource container layer then I think that is the big win :) I see datastore and filestore as examples of resource containers. Datapusher is an example of an ETL that works with datastore but similar tools and concepts can be worked into the model and the open source goodness can grow organically to meet lots of different organisational needs.

Where CKAN differs from other portal software, in my experience, is that it can be used for open Government data, research data, private sector data and 'data as knowledge' in virtually any situation. Other portal software appears to be built around capturing a particular market opportunity to generate data as knowledge for a particular customer segment - civic hackers, jurisdictional bureaucrats, open data policy implementations, etc.

CKAN's harvesting is good, but certainly not perfect. The approach for pushing from CKAN to elsewhere is likely to be used more in our future work, or as we refactor the architecture of current implementations. See: https://github.com/DataShades/ckanext-syndicate

By using multiple CKAN environments it is pretty easy to have catalogs of 'working data' that then push to the 'published data' catalog. We use this approach for Government open data when from the bottom up you have agency data collected into CKAN based information asset registers. Sometimes the data doesn't even exist, but the data management plan can at least first be registered prior to populating the dataset with resources. Once the data is ready it can then be published and syndicated upward to a higher level jurisdictional portal - such as a council, city, state or province. Similarly such datasets can then be syndicated upward again into a national or regional portal - perhaps with further ETL functions put in place to combine the similarly structured data from multiple agencies into a master dataset that presents a larger view of the entire data collection effort.

If the domain of data collection differs, such as in a field of research, then the same architecture can still apply. Multiple research schools of chemistry, for example, could publish working data locally then syndicate upward into a global repository that allows for meta analysis of all research outcomes over the entire domain's efforts. We're working on a project in just this manner that is referenced here: http://linkdigital.com.au/news/2016/09/building-mdbox-an-open-access-simulation-data-repository-on-ckan-and-aws

Lastly, published open data is the result of effort which is put into a process of data collection and, usually, some analysis and clean up. The tools used to process data, to prepare, collect or visulise are all part of the value a dataset represents. To bridge data and code we've released a very simple resource view for GitHub repositories that can be found here: https://github.com/DataShades/ckanext-githubrepopreview

Open Government initiatives are formed around principles of transparency, participation and collaboration. There is a desire to enable public-private collaboration over the long term and there is a role for Government to act as impresario to stimulate new markets and economic activity from publishing open data (ref: https://www.nesta.org.uk/sites/default/files/government_as_impresario.pdf). The reason we built the GitHub resource view is to encourage open source projects to emerge in connection to public datasets, via linking the opportunity for discovery of helpful code with the discovery of helpful datasets.

Sorry for the belated and long reply on this thread! I could have more succinctly just said CKAN rocks, check out all the open source goodness surrounding it and jump in :)

Cheers,
Steven

STEVEN DE COSTA | EXECUTIVE DIRECTOR
www.linkdigital.com.au<http://www.linkdigital.com.au/>

[http://www.linkdigital.com.au/email/logo-apn-acp-agp.png]  [https://association.drupal.org/files/Drupal_Association_sup_partner_80.png]  [http://www.linkdigital.com.au/email/ckan_association.jpg]

On 13 October 2016 at 13:48, Claire Herbert <Claire.Herbert at umanitoba.ca<mailto:Claire.Herbert at umanitoba.ca>> wrote:

Sounds very interesting David. Would you be able share a high level diagram of the architecture? It sounds very useful in potentially helping organization like ours (University) plan a larger deployment.

Claire

________________________________
From: Fawcett, David (MNIT) <David.Fawcett at state.mn.us<mailto:David.Fawcett at state.mn.us>>
Sent: 06 October 2016 10:02
To: CKAN Development Discussions
Subject: Re: [ckan-dev] road map for horizontal scaling?

RMX,

We currently don’t store many, if any, of the datasets in the database.  We put CKAN in front of an internal data distribution system, with our CKAN instance essentially becoming just another node on the system.  When a dataset is updated in the system, it gets pushed out to all designated nodes, and we run a script nightly to read dataset metadata and push new/updated records to CKAN via API.

Here is an example dataset (we call them resources because they include web apps, desktops, and data):

https://gisdata.mn.gov/dataset/env-buffer-protection-mn

Buffer Protection Map, Minnesota - Resources - Minnesota Geospatial Commons
These data represent public waters and public ditches that require permanent vegetation buffers or alternative riparian water quality practices. The buffer map data comprise two geographical...
Read more...<https://gisdata.mn.gov/dataset/env-buffer-protection-mn>

Most of the info on the page comes from the spatial metadata.  The overview text comes from the metadata Abstract element, the tags come from metadata key words, etc.

Our state manages a lot of data in ESRI’s proprietary file geodatabase format, but to make the data accessible, we automatically generate shapefile and geopackage copies of the data and publish them as well.  This allows people to access the data without expensive licenses and proprietary software.

In this example, you can also see that there is a link to view the full metadata record, and this resource has an associated Web map, so there is button to go there too.

The file-based datasets and metadata documents are not stored on the same server as our CKAN instance.  They are on a different FTP server.  E.g. ftp://ftp.gisdata.mn.gov/pub/gdrs/data/pub/us_mn_state_dnr/env_buffer_protection_mn/shp_env_buffer_protection_mn.zip

David.

From: ckan-dev [mailto:ckan-dev-bounces at lists.okfn.org<mailto:ckan-dev-bounces at lists.okfn.org>] On Behalf Of Ruima E.
Sent: Thursday, October 06, 2016 1:08 AM
To: CKAN Development Discussions <ckan-dev at lists.okfn.org<mailto:ckan-dev at lists.okfn.org>>
Subject: Re: [ckan-dev] road map for horizontal scaling?

Thank you David,

That is very good to know.

All those datasets do they fit in one machine?

Are you using postgreSQL to store the datasets, or just the metadata?

Best regards,

RMX

On Thu, Oct 6, 2016 at 3:07 AM, Fawcett, David (MNIT) <David.Fawcett at state.mn.us<mailto:David.Fawcett at state.mn.us>> wrote:

RMX,

Our US state is running CKAN on Postgres.  We currently have about 600 datasets, and we are not anywhere close to being limited by the database.

data.gov<http://data.gov> has about 190,000 datasets and performs fine.

David.

________________________________

From: ckan-dev [ckan-dev-bounces at lists.okfn.org<mailto:ckan-dev-bounces at lists.okfn.org>] on behalf of Ruima E. [ruimaximo at gmail.com<mailto:ruimaximo at gmail.com>]
Sent: Wednesday, October 05, 2016 2:40 PM
To: CKAN Development Discussions
Subject: Re: [ckan-dev] road map for horizontal scaling?

Thank you Tim!

I am asking these questions because I am considering installing a CKAN as a data hub for a city. It seems a very promising ideia but I am concerned that if tomorrow the number of datasets grows and we will need it to be distributed through several machines, the PosgreSQL might be a bottleneck and a headache.

When I think about scale I have in mind the example of Hadoop. If tomorrow the datasets cannot fit one machine, just add one more node, edit a few text files and it works seamless. I am afraid that with PosgreSQL that is not the case, or am I wrong?

Best regards,

RMX

On Wed, Oct 5, 2016 at 8:52 PM, Timothy Giles <timothy.giles at slu.se<mailto:timothy.giles at slu.se>> wrote:

Hi RMX.

I wonder if you can give a concrete example of what you mean by scale? Since this is a dev forum/mailing list, I think it would helpful to quantify your issue(s) / conern(s). There are instances of CKAN with hundred of thousands and millions of datasets, as well as individual datasets being extremely large ('00s GBs).

MvH Tim

________________________________

From: ckan-dev <ckan-dev-bounces at lists.okfn.org<mailto:ckan-dev-bounces at lists.okfn.org>> on behalf of Ruima E. <ruimaximo at gmail.com<mailto:ruimaximo at gmail.com>>
Sent: 05 October 2016 02:40 PM
To: ckan-dev at lists.okfn.org<mailto:ckan-dev at lists.okfn.org>
Subject: [ckan-dev] road map for horizontal scaling?

Hi,

At the moment ckan relies on PostgreSQL as a data store. I was shocked when I found that such nice project relies on a data store that is not suitable to scale. Open data in smart cities is expected to be Big Data and it is expected to scale, jeopardizing the success of the whole initiative in a near future.

Is scaling by using open source technologies part of the  road map for CKAN?

Thank you,

RMX

_______________________________________________
ckan-dev mailing list
ckan-dev at lists.okfn.org<mailto:ckan-dev at lists.okfn.org>
https://lists.okfn.org/mailman/listinfo/ckan-dev
Unsubscribe: https://lists.okfn.org/mailman/options/ckan-dev

_______________________________________________
ckan-dev mailing list
ckan-dev at lists.okfn.org<mailto:ckan-dev at lists.okfn.org>
https://lists.okfn.org/mailman/listinfo/ckan-dev
Unsubscribe: https://lists.okfn.org/mailman/options/ckan-dev

_______________________________________________
ckan-dev mailing list
ckan-dev at lists.okfn.org<mailto:ckan-dev at lists.okfn.org>
https://lists.okfn.org/mailman/listinfo/ckan-dev
Unsubscribe: https://lists.okfn.org/mailman/options/ckan-dev

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/ckan-dev/attachments/20161018/387afc41/attachment-0003.html>