[ckan-dev] Search inside data files

John Jediny - XAAB john.jediny at gsa.gov
Fri Mar 11 12:58:13 UTC 2016


Missed that this used Apache Tika, looks like a great multi-format parser.
Agree with David that a sharing one postgres between the fulltext and CKAN
in production might be a bottleneck. But overall looks like the best ext
for the job. Might be worth looking into adding a separate/dedicated
NoSQL/blob db to store the fulltext output

On Fri, Mar 11, 2016 at 6:40 AM, David Read <david.read at hackneyworkshop.com>
wrote:

> It sounds like Matt's suggestion is the closest:
> https://github.com/transparenzportalhamburg/ckanext-fulltext
> I have reservations about storing the extracted text in postgres - I
> don't see that scaling well for large sites. But we might well give it
> a try.
>
> Dave
>
> On 11 March 2016 at 00:47, John Jediny - XAAB <john.jediny at gsa.gov> wrote:
> > Stab in the dark but sounds like you'd have to use ckan archiver to
> cache a
> > copy, use reclinedb or another multiformat parser and have that spit out
> to
> > elasticsearch (or Solr..) to expose those as search terms... has anyone
> done
> > something like this... Open Source of course?
> >
> > ...Propreitary plug-in vendors please focus on a different community
> please
> >
> > On Mar 10, 2016 5:10 PM, "Natalia Queiroz" <queiroz.nati at gmail.com>
> wrote:
> >>
> >> Anyone?
> >>
> >> On Wed, Mar 9, 2016 at 6:27 PM, Natalia Queiroz <queiroz.nati at gmail.com
> >
> >> wrote:
> >>>
> >>> So, is Ckan able just to search for metadata values?
> >>>
> >>> How is that possible to search values inside a csv file, for example.
> >>>
> >>>  My search doesn't bring data these king of data, just the metadata.
> I'm
> >>> using DataStore and DataPusher.
> >>>
> >>> Any idea?
> >>>
> >>> On Tue, Mar 8, 2016 at 5:04 AM, Matthew Fullerton
> >>> <matt.fullerton at gmail.com> wrote:
> >>>>
> >>>> There is also the aomewhat simpler approach from Hamburg:
> >>>>
> >>>> https://github.com/transparenzportalhamburg/ckanext-fulltext
> >>>>
> >>>> -Matt
> >>>>
> >>>> On 7 Mar 2016 3:40 p.m., "David Read" <david.read at hackneyworkshop.com
> >
> >>>> wrote:
> >>>>>
> >>>>> Vangelis kindly responded by mentioning the main technologies that
> >>>>> dataopen.eu is based on:
> >>>>>
> >>>>> Apache Tika, MariaDB, Sphinx Search & Redis
> >>>>>
> >>>>> The back-end glue is Scala, and is sadly closed-source. But it really
> >>>>> shows the possibilities for us all to evaluate. We'll certainly be
> >>>>> watching it and people's interest in it as a way to search CKAN in a
> >>>>> deeper way.
> >>>>>
> >>>>> In the meantime, I'd love to know if people have ideas in how this
> >>>>> could be done for CKAN.
> >>>>>
> >>>>> David
> >>>>>
> >>>>> On 7 March 2016 at 11:38, David Read <david.read at hackneyworkshop.com
> >
> >>>>> wrote:
> >>>>> > I noticed the implementation of search *inside* of the PDF/XLS/CSV
> >>>>> > files listed in CKANs:
> >>>>> >
> >>>>> > http://www.epsiplatform.eu/content/searching-open-data-never-0
> >>>>> > http://dataopen.eu
> >>>>> >
> >>>>> > Since it's associated with Open Knowledge it would be great if the
> >>>>> > rest of the CKAN community can take advantage of the code - does
> >>>>> > anyone know? I've just sent an email to Vangelis Banos to ask more.
> >>>>> >
> >>>>> > I wonder if anyone else has attempted this sort of thing? I imagine
> >>>>> > it's a challenge to create such a big index. Our CKAN SOLR index is
> >>>>> > big enough already with just the metadata!
> >>>>> >
> >>>>> > David
> >>>>> _______________________________________________
> >>>>> ckan-dev mailing list
> >>>>> ckan-dev at lists.okfn.org
> >>>>> https://lists.okfn.org/mailman/listinfo/ckan-dev
> >>>>> Unsubscribe: https://lists.okfn.org/mailman/options/ckan-dev
> >>>>
> >>>>
> >>>> _______________________________________________
> >>>> ckan-dev mailing list
> >>>> ckan-dev at lists.okfn.org
> >>>> https://lists.okfn.org/mailman/listinfo/ckan-dev
> >>>> Unsubscribe: https://lists.okfn.org/mailman/options/ckan-dev
> >>>>
> >>>
> >>>
> >>>
> >>> --
> >>>
> >>>
> >>> Natália Oliveira
> >>
> >>
> >>
> >>
> >> --
> >>
> >>
> >> Natália Oliveira
> >>
> >> _______________________________________________
> >> ckan-dev mailing list
> >> ckan-dev at lists.okfn.org
> >> https://lists.okfn.org/mailman/listinfo/ckan-dev
> >> Unsubscribe: https://lists.okfn.org/mailman/options/ckan-dev
> >>
> >
> > _______________________________________________
> > ckan-dev mailing list
> > ckan-dev at lists.okfn.org
> > https://lists.okfn.org/mailman/listinfo/ckan-dev
> > Unsubscribe: https://lists.okfn.org/mailman/options/ckan-dev
> >
> _______________________________________________
> ckan-dev mailing list
> ckan-dev at lists.okfn.org
> https://lists.okfn.org/mailman/listinfo/ckan-dev
> Unsubscribe: https://lists.okfn.org/mailman/options/ckan-dev
>



-- 
Chief Data Engineer
202-341-0191
@Data.gov
@Office of Citizen Science and Innovative Technologies/18F
<http://www.gsa.gov/portal/category/25729>
General Services Administration

Work in the Open... ideate, innovate, iterate...
@github <https://github.com/JJediny> | @projectopendata
<https://github.com/project-open-data>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/ckan-dev/attachments/20160311/3d3cf8ab/attachment-0003.html>


More information about the ckan-dev mailing list