[ckan-dev] ckan-dev Digest, Vol 47, Issue 9

Alfredo Serafini seralf at gmail.com
Fri Sep 12 12:25:31 UTC 2014


Hi
Solr already have compontents to index not only the metadata, but also the
text of pdf
The indexing for the text is rather basic (all the text is indexed as it is
extracted by tika), but it's a good start, and it's even possible to do
some maniplution of the lines during the update phase... otherwise you have
to create an external component, if you need to parse the text as a
strucutured text, before indexing it


a good start is here:
https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika

I don't know much about how to plug a different configuration for the CKAN
side, but if you need some help for configuruing Solr in itself for this, I
could give some help if needed


2014-09-12 14:00 GMT+02:00 <ckan-dev-request at lists.okfn.org>:

> Send ckan-dev mailing list submissions to
>         ckan-dev at lists.okfn.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
>         https://lists.okfn.org/mailman/listinfo/ckan-dev
> or, via email, send a message with subject or body 'help' to
>         ckan-dev-request at lists.okfn.org
>
> You can reach the person managing the list at
>         ckan-dev-owner at lists.okfn.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of ckan-dev digest..."
>
>
> Today's Topics:
>
>    1. Full-text search of PDF files in CKAN? (Andrew White)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Fri, 12 Sep 2014 03:27:12 +0000
> From: Andrew White <WhiteA at landcareresearch.co.nz>
> To: "ckan-dev at lists.okfn.org" <ckan-dev at lists.okfn.org>
> Subject: [ckan-dev] Full-text search of PDF files in CKAN?
> Message-ID:
>         <
> A57AEDECCE1FE24CA5F973D44E2D16E8122145DA at HERMES.landcare.ad.landcareresearch.co.nz
> >
>
> Content-Type: text/plain; charset="us-ascii"
>
> Greetings from a new subscriber!
>
> Our institution has begun using CKAN for data archiving, and it has been
> suggested that we could also use it as a document repository. Documents
> would mainly be PDF but other file types eg. .doc .txt might be included.
>
> One of the features we would want in a document repository is the ability
> to search the full text of documents, including PDFs that include
> searchable text.
>
> Has anyone implemented such a search in CKAN? What would be required - a
> new extension if none exists already? Presumably solr could provide the
> search if the text field could be indexed somehow. Perhaps a metadata field
> containing the text, but automatically populated by parsing the document on
> addition to CKAN?
>
> Is there any limit to the size of the metadata fields, for indexing
> purposes?
>
> Regards
>
> Andrew White
> Information Systems Support Specialist
> Landcare Research New Zealand
> PO Box 69040
> Lincoln, Canterbury 7640
> New Zealand
>
> Phone: +64 3 321 9815
> Fax: + 64 3 321 9998
>
>
> ________________________________
>
> Please consider the environment before printing this email
> Warning: This electronic message together with any attachments is
> confidential. If you receive it in error: (i) you must not read, use,
> disclose, copy or retain it; (ii) please contact the sender immediately by
> reply email and then delete the emails.
> The views expressed in this email may not be those of Landcare Research
> New Zealand Limited. http://www.landcareresearch.co.nz
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <
> http://lists.okfn.org/pipermail/ckan-dev/attachments/20140912/335782ba/attachment-0001.html
> >
>
> ------------------------------
>
> Subject: Digest Footer
>
> _______________________________________________
> ckan-dev mailing list
> ckan-dev at lists.okfn.org
> https://lists.okfn.org/mailman/listinfo/ckan-dev
> Unsubscribe: https://lists.okfn.org/mailman/optionss/ckan-dev
>
>
> ------------------------------
>
> End of ckan-dev Digest, Vol 47, Issue 9
> ***************************************
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/ckan-dev/attachments/20140912/c799f059/attachment-0002.html>


More information about the ckan-dev mailing list