[ckan-dev] Full-text search of PDF files in CKAN?

Fri Sep 12 14:42:54 UTC 2014

On Thu, Sep 11, 2014 at 11:27 PM, Andrew White
<WhiteA at landcareresearch.co.nz> wrote:
> Our institution has begun using CKAN for data archiving, and it has been
> suggested that we could also use it as a document repository. Documents
> would mainly be PDF but other file types eg. .doc .txt might be included.
>
> One of the features we would want in a document repository is the ability to
> search the full text of documents, including PDFs that include searchable
> text.
>
> Has anyone implemented such a search in CKAN? What would be required – a new
> extension if none exists already? Presumably solr could provide the search
> if the text field could be indexed somehow. Perhaps a metadata field
> containing the text, but automatically populated by parsing the document on
> addition to CKAN?
>
> Is there any limit to the size of the metadata fields, for indexing
> purposes?

Storing the complete text in an metadata field means that the full
text will be retrieved every time the dataset is viewed, which could
be pretty slow.

I think a plugin that adds the text in
IPackageController.before_search and a new field in the solr schema
would be a good approach. This seems like something that would be very
useful as a general-pupose extension. It's worth submitting a full
description of how you would want such a feature to behave to
https://github.com/ckan/ideas-and-roadmap/issues.

Ian