[ckan-dev] Full-text search of PDF files in CKAN?

Sun Sep 14 11:45:37 UTC 2014

Hi Andrew,

we do full text analysis with TIKA during harvesting. TIKA is able to 
process a broad number of formats.
We store the fulltext in an extra field. This is, as Ian pointed out, 
potentially inefficient. However, this
depends on the number of documents and the size of the fulltext. In our 
case, we have 15000 documents within a 2GB SOLR-index. (The website is 
open since this week: suche.transparenz.hamburg.de which has two 
CKAN-Search-Servers behind, an extra database server and a further 
harvesting server. This server-layout, of course, is not only caused by 
the full text ;-).)

For improving this situation, we currently work out a CKAN-extension 
which allows to temporarily store the full text in extra fields (for 
continuing with this programming scheme), but to finally store specific 
fields not in the "normal" CKAN-package but in specific SOLR-fields. 
Thus, if retrieved, the full text field will not be provided in the 
data-dict. However, the full text field can make use of SOLR-indexing. A 
change in the CKAN-SOLR-schema is also a follow up of this extension.

TIKA could also be used directly within SOLR by specifying TIKA in the 
SOLR-Schema (schema.xml), as also someone else pointed out. This would 
also need a change of the CKAN-SOLR-Schema, which we had tried to avoid 
in former times. Thus, we chose the harvesting integration as pointed 
out above.

Best wishes,
Lothar

Am 12.09.2014 16:42, schrieb Ian Ward:
> On Thu, Sep 11, 2014 at 11:27 PM, Andrew White
> <WhiteA at landcareresearch.co.nz> wrote:
>> Our institution has begun using CKAN for data archiving, and it has been
>> suggested that we could also use it as a document repository. Documents
>> would mainly be PDF but other file types eg. .doc .txt might be included.
>>
>> One of the features we would want in a document repository is the ability to
>> search the full text of documents, including PDFs that include searchable
>> text.
>>
>> Has anyone implemented such a search in CKAN? What would be required – a new
>> extension if none exists already? Presumably solr could provide the search
>> if the text field could be indexed somehow. Perhaps a metadata field
>> containing the text, but automatically populated by parsing the document on
>> addition to CKAN?
>>
>> Is there any limit to the size of the metadata fields, for indexing
>> purposes?
> Storing the complete text in an metadata field means that the full
> text will be retrieved every time the dataset is viewed, which could
> be pretty slow.
>
> I think a plugin that adds the text in
> IPackageController.before_search and a new field in the solr schema
> would be a good approach. This seems like something that would be very
> useful as a general-pupose extension. It's worth submitting a full
> description of how you would want such a feature to behave to
> https://github.com/ckan/ideas-and-roadmap/issues.
>
> Ian
> _______________________________________________
> ckan-dev mailing list
> ckan-dev at lists.okfn.org
> https://lists.okfn.org/mailman/listinfo/ckan-dev
> Unsubscribe: https://lists.okfn.org/mailman/options/ckan-dev