[ckan-dev] road map for horizontal scaling?

Steven De Costa steven.decosta at linkdigital.com.au
Thu Oct 20 02:12:54 UTC 2016


Re the GitHub extension it is pretty basic right now. No UI based config
and such. I think you'd be able to fork it and tweak it for use on your
internal server without much hassle :)

Cheers,
Steven

*STEVEN DE COSTA *|
*EXECUTIVE DIRECTOR*www.linkdigital.com.au



On 19 October 2016 at 01:14, Claire Herbert <Claire.Herbert at umanitoba.ca>
wrote:

> Thanks very much for the detailed reply Steven! It helps clarify what kind
> of framework we can set up for ckan in our academic environment, where we
> serve users with working research datasets (private), shared datasets and
> publicly available datasets where we want better visualization,as well as
> some community based monitoring work.  Trying to describe the vision for
> ckan in the big picture to senior researchers and funders can be
> challenging. Plus this helps me visualize what ckan can perhaps do for us
> in the future so we can plan accordingly.
>
>
> Re: the github repo. Would this work with a local server instance of
> gitlab.
>
>
> Cheers,
>
> Claire
>
>
> Claire Herbert
>
> Coordinator
>
> Lake Winnipeg Basin Information Network
>
>
>
> Centre For Earth Observation Science
>
> Department of Environment and Geography
>
> 522 Wallace Building
>
> University of Manitoba
>
> Winnipeg, Canada, R3T 3N2
>
> Phone: (204) 474-8657
>
>
> Follow us on twitter - @LWBIN
>
> Web: http://lwbi.cc.umanitoba.ca/
>
>
>
>
> ------------------------------
> *From:* Steven De Costa <steven.decosta at linkdigital.com.au>
> *Sent:* 13 October 2016 18:29
>
> *To:* CKAN Development Discussions
> *Subject:* Re: [ckan-dev] road map for horizontal scaling?
>
> Attaching our high level architecture using RDS on AWS --- for UAT and
> PROD:
>
> CloudFormation scripts for building out CKAN in a HA config can be found
> at https://github.com/DataShades/ckan-aws-templates
>
> OpWorks version is here: https://github.com/DataShades/opswx-ckan-cookbook
>
> Happy to collaborate on this and make it shine brighter :)
>
> There are a few other relevant scripts under our datashades set of repos,
> such as the ASG one here: https://github.com/DataShades/updateasg
>
> And, the general cloud storage one here: https://github.com/
> DataShades/ckanext-cloudstorage
>
> And the S3 related one here: https://github.com/
> DataShades/ckanext-s3filestore
>
> We've also improved the SSO approach with Saml2: https://github.com/
> DataShades/ckanext-saml2
>
> And, begun some work for manipulating ACLs, which is important for private
> dataset resources you'd want to switch to 'public' when published:
> https://github.com/DataShades/ckanext-acl
>
> Although not formally part of the CKAN roadmap I have a working model of
> where I'd like CKAN to head when it comes to enterprise file/data storage
> and access. If you are familiar with the concept of resource views then the
> idea I'm keen to pursue is similar. It is a concept of resource containers
> (not para-virtualization containers but storage or access point
> containers). The idea is to make CKAN extendable via extensions of a type
> that allow it to do more orchestration around how data is stored and made
> usable below the discovery layer of the metadata.
>
> The story would be something like:
> As a platform operator, I need to be able to configure a variety of
> storage and access endpoint possibilities, so that custodians can select
> where data is placed based on type of data or business need.
>
> Resource container extensions would then be built to accommodate things
> like:
>
>    1. Big data, transnational data feeds
>    2. Semantic lakes
>    3. Large file storage blobs
>    4. Self declarative structured data (likely using data
>    packaging/frictionless data)
>    5. For cost auditing and accountability - storage into specified paid
>    cloud accounts (different AWS, Azure, etc. accounts based on organisation)
>
>
> I would image that resource view and resource container extensions would
> be paired in many cases to allow for the view to provide greater access and
> control of the data to provide an ability to query and extract insights
> from the data.
>
> The European Data Portal has around 650k datasets. It is true that once a
> CKAN portal gets to such a size then it can be a chore to do anything over
> the entire set of data in quick time. However, with the entire catalog
> readable via API there is a place for other tools to come into the picture
> to provide meta analysis or broader views over all data in a portal.
>
> CKAN's structure allows for data ownership and custodianship to remain
> flexible as the governing entities change over time. If we keen those
> functions lightweight and build the more intensive data processing tasks
> within a resource container layer then I think that is the big win :) I see
> datastore and filestore as examples of resource containers. Datapusher is
> an example of an ETL that works with datastore but similar tools and
> concepts can be worked into the model and the open source goodness can grow
> organically to meet lots of different organisational needs.
>
> Where CKAN differs from other portal software, in my experience, is that
> it can be used for open Government data, research data, private sector data
> and 'data as knowledge' in virtually any situation. Other portal software
> appears to be built around capturing a particular market opportunity to
> generate data as knowledge for a particular customer segment - civic
> hackers, jurisdictional bureaucrats, open data policy implementations, etc.
>
> CKAN's harvesting is good, but certainly not perfect. The approach for
> pushing from CKAN to elsewhere is likely to be used more in our future
> work, or as we refactor the architecture of current implementations. See:
> https://github.com/DataShades/ckanext-syndicate
>
> By using multiple CKAN environments it is pretty easy to have catalogs of
> 'working data' that then push to the 'published data' catalog. We use this
> approach for Government open data when from the bottom up you have agency
> data collected into CKAN based information asset registers. Sometimes the
> data doesn't even exist, but the data management plan can at least first be
> registered prior to populating the dataset with resources. Once the data is
> ready it can then be published and syndicated upward to a higher level
> jurisdictional portal - such as a council, city, state or province.
> Similarly such datasets can then be syndicated upward again into a national
> or regional portal - perhaps with further ETL functions put in place to
> combine the similarly structured data from multiple agencies into a master
> dataset that presents a larger view of the entire data collection effort.
>
> If the domain of data collection differs, such as in a field of research,
> then the same architecture can still apply. Multiple research schools of
> chemistry, for example, could publish working data locally then syndicate
> upward into a global repository that allows for meta analysis of all
> research outcomes over the entire domain's efforts. We're working on a
> project in just this manner that is referenced here:
> http://linkdigital.com.au/news/2016/09/building-mdbox-an-open-access-
> simulation-data-repository-on-ckan-and-aws
>
> Lastly, published open data is the result of effort which is put into a
> process of data collection and, usually, some analysis and clean up. The
> tools used to process data, to prepare, collect or visulise are all part of
> the value a dataset represents. To bridge data and code we've released a
> very simple resource view for GitHub repositories that can be found here:
> https://github.com/DataShades/ckanext-githubrepopreview
>
> Open Government initiatives are formed around principles of transparency,
> participation and collaboration. There is a desire to enable public-private
> collaboration over the long term and there is a role for Government to act
> as impresario to stimulate new markets and economic activity from
> publishing open data (ref: https://www.nesta.org.uk/sites/default/files/
> government_as_impresario.pdf). The reason we built the GitHub resource
> view is to encourage open source projects to emerge in connection to public
> datasets, via linking the opportunity for discovery of helpful code with
> the discovery of helpful datasets.
>
> Sorry for the belated and long reply on this thread! I could have more
> succinctly just said CKAN rocks, check out all the open source goodness
> surrounding it and jump in :)
>
> Cheers,
> Steven
>
> *STEVEN DE COSTA *|
> *EXECUTIVE DIRECTOR *www.linkdigital.com.au
>
>
>
> On 13 October 2016 at 13:48, Claire Herbert <Claire.Herbert at umanitoba.ca>
> wrote:
>
>> Sounds very interesting David. Would you be able share a high level
>> diagram of the architecture? It sounds very useful in potentially helping
>> organization like ours (University) plan a larger deployment.
>>
>>
>> Claire
>>
>>
>>
>>
>>
>>
>> ------------------------------
>> *From:* Fawcett, David (MNIT) <David.Fawcett at state.mn.us>
>> *Sent:* 06 October 2016 10:02
>> *To:* CKAN Development Discussions
>> *Subject:* Re: [ckan-dev] road map for horizontal scaling?
>>
>>
>> RMX,
>>
>>
>>
>> We currently don’t store many, if any, of the datasets in the database.
>> We put CKAN in front of an internal data distribution system, with our CKAN
>> instance essentially becoming just another node on the system.  When a
>> dataset is updated in the system, it gets pushed out to all designated
>> nodes, and we run a script nightly to read dataset metadata and push
>> new/updated records to CKAN via API.
>>
>>
>>
>> Here is an example dataset (we call them resources because they include
>> web apps, desktops, and data):
>>
>> https://gisdata.mn.gov/dataset/env-buffer-protection-mn
>> Buffer Protection Map, Minnesota - Resources - Minnesota Geospatial
>> Commons
>> These data represent public waters and public ditches that require
>> permanent vegetation buffers or alternative riparian water quality
>> practices. The buffer map data comprise two geographical...
>> Read more... <https://gisdata.mn.gov/dataset/env-buffer-protection-mn>
>>
>>
>>
>> Most of the info on the page comes from the spatial metadata.  The
>> overview text comes from the metadata Abstract element, the tags come from
>> metadata key words, etc.
>>
>>
>>
>> Our state manages a lot of data in ESRI’s proprietary file geodatabase
>> format, but to make the data accessible, we automatically generate
>> shapefile and geopackage copies of the data and publish them as well.  This
>> allows people to access the data without expensive licenses and proprietary
>> software.
>>
>>
>>
>> In this example, you can also see that there is a link to view the full
>> metadata record, and this resource has an associated Web map, so there is
>> button to go there too.
>>
>>
>>
>> The file-based datasets and metadata documents are not stored on the same
>> server as our CKAN instance.  They are on a different FTP server.  E.g.
>> ftp://ftp.gisdata.mn.gov/pub/gdrs/data/pub/us_mn_state_dnr/e
>> nv_buffer_protection_mn/shp_env_buffer_protection_mn.zip
>>
>>
>>
>> David.
>>
>>
>>
>>
>>
>>
>>
>> *From:* ckan-dev [mailto:ckan-dev-bounces at lists.okfn.org] *On Behalf Of *Ruima
>> E.
>> *Sent:* Thursday, October 06, 2016 1:08 AM
>> *To:* CKAN Development Discussions <ckan-dev at lists.okfn.org>
>> *Subject:* Re: [ckan-dev] road map for horizontal scaling?
>>
>>
>>
>> Thank you David,
>>
>>
>>
>> That is very good to know.
>>
>> All those datasets do they fit in one machine?
>>
>> Are you using postgreSQL to store the datasets, or just the metadata?
>>
>>
>>
>> Best regards,
>>
>> RMX
>>
>>
>>
>> On Thu, Oct 6, 2016 at 3:07 AM, Fawcett, David (MNIT) <
>> David.Fawcett at state.mn.us> wrote:
>>
>> RMX,
>>
>> Our US state is running CKAN on Postgres.  We currently have about 600
>> datasets, and we are not anywhere close to being limited by the database.
>>
>> data.gov has about 190,000 datasets and performs fine.
>>
>> David.
>> ------------------------------
>>
>> *From:* ckan-dev [ckan-dev-bounces at lists.okfn.org] on behalf of Ruima E.
>> [ruimaximo at gmail.com]
>> *Sent:* Wednesday, October 05, 2016 2:40 PM
>> *To:* CKAN Development Discussions
>> *Subject:* Re: [ckan-dev] road map for horizontal scaling?
>>
>> Thank you Tim!
>>
>> I am asking these questions because I am considering installing a CKAN as
>> a data hub for a city. It seems a very promising ideia but I am concerned
>> that if tomorrow the number of datasets grows and we will need it to be
>> distributed through several machines, the PosgreSQL might be a bottleneck
>> and a headache.
>>
>> When I think about scale I have in mind the example of Hadoop. If
>> tomorrow the datasets cannot fit one machine, just add one more node, edit
>> a few text files and it works seamless. I am afraid that with PosgreSQL
>> that is not the case, or am I wrong?
>>
>>
>>
>> Best regards,
>>
>> RMX
>>
>>
>>
>>
>>
>> On Wed, Oct 5, 2016 at 8:52 PM, Timothy Giles <timothy.giles at slu.se>
>> wrote:
>>
>> Hi RMX.
>>
>>
>>
>> I wonder if you can give a concrete example of what you mean by scale?
>> Since this is a dev forum/mailing list, I think it would helpful to
>> quantify your issue(s) / conern(s). There are instances of CKAN with
>> hundred of thousands and millions of datasets, as well as individual
>> datasets being extremely large ('00s GBs).
>>
>>
>>
>> MvH Tim
>>
>>
>>
>>
>>
>>
>> ------------------------------
>>
>> *From:* ckan-dev <ckan-dev-bounces at lists.okfn.org> on behalf of Ruima E.
>> <ruimaximo at gmail.com>
>> *Sent:* 05 October 2016 02:40 PM
>> *To:* ckan-dev at lists.okfn.org
>> *Subject:* [ckan-dev] road map for horizontal scaling?
>>
>>
>>
>> Hi,
>>
>>
>>
>> At the moment ckan relies on PostgreSQL as a data store. I was shocked
>> when I found that such nice project relies on a data store that is not
>> suitable to scale. Open data in smart cities is expected to be Big Data and
>> it is expected to scale, jeopardizing the success of the whole initiative
>> in a near future.
>>
>>
>>
>> Is scaling by using open source technologies part of the  road map for
>> CKAN?
>>
>>
>>
>> Thank you,
>>
>> RMX
>>
>>
>> _______________________________________________
>> ckan-dev mailing list
>> ckan-dev at lists.okfn.org
>> https://lists.okfn.org/mailman/listinfo/ckan-dev
>> Unsubscribe: https://lists.okfn.org/mailman/options/ckan-dev
>>
>>
>>
>>
>> _______________________________________________
>> ckan-dev mailing list
>> ckan-dev at lists.okfn.org
>> https://lists.okfn.org/mailman/listinfo/ckan-dev
>> Unsubscribe: https://lists.okfn.org/mailman/options/ckan-dev
>>
>>
>>
>> _______________________________________________
>> ckan-dev mailing list
>> ckan-dev at lists.okfn.org
>> https://lists.okfn.org/mailman/listinfo/ckan-dev
>> Unsubscribe: https://lists.okfn.org/mailman/options/ckan-dev
>>
>>
>
> _______________________________________________
> ckan-dev mailing list
> ckan-dev at lists.okfn.org
> https://lists.okfn.org/mailman/listinfo/ckan-dev
> Unsubscribe: https://lists.okfn.org/mailman/options/ckan-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/ckan-dev/attachments/20161020/2729c851/attachment-0003.html>


More information about the ckan-dev mailing list