[ckan-dev] Full-text search of PDF files in CKAN?
hotz
hotz at informatik.uni-hamburg.de
Sun Sep 14 11:45:37 UTC 2014
Hi Andrew,
we do full text analysis with TIKA during harvesting. TIKA is able to
process a broad number of formats.
We store the fulltext in an extra field. This is, as Ian pointed out,
potentially inefficient. However, this
depends on the number of documents and the size of the fulltext. In our
case, we have 15000 documents within a 2GB SOLR-index. (The website is
open since this week: suche.transparenz.hamburg.de which has two
CKAN-Search-Servers behind, an extra database server and a further
harvesting server. This server-layout, of course, is not only caused by
the full text ;-).)
For improving this situation, we currently work out a CKAN-extension
which allows to temporarily store the full text in extra fields (for
continuing with this programming scheme), but to finally store specific
fields not in the "normal" CKAN-package but in specific SOLR-fields.
Thus, if retrieved, the full text field will not be provided in the
data-dict. However, the full text field can make use of SOLR-indexing. A
change in the CKAN-SOLR-schema is also a follow up of this extension.
TIKA could also be used directly within SOLR by specifying TIKA in the
SOLR-Schema (schema.xml), as also someone else pointed out. This would
also need a change of the CKAN-SOLR-Schema, which we had tried to avoid
in former times. Thus, we chose the harvesting integration as pointed
out above.
Best wishes,
Lothar
Am 12.09.2014 16:42, schrieb Ian Ward:
> On Thu, Sep 11, 2014 at 11:27 PM, Andrew White
> <WhiteA at landcareresearch.co.nz> wrote:
>> Our institution has begun using CKAN for data archiving, and it has been
>> suggested that we could also use it as a document repository. Documents
>> would mainly be PDF but other file types eg. .doc .txt might be included.
>>
>> One of the features we would want in a document repository is the ability to
>> search the full text of documents, including PDFs that include searchable
>> text.
>>
>> Has anyone implemented such a search in CKAN? What would be required – a new
>> extension if none exists already? Presumably solr could provide the search
>> if the text field could be indexed somehow. Perhaps a metadata field
>> containing the text, but automatically populated by parsing the document on
>> addition to CKAN?
>>
>> Is there any limit to the size of the metadata fields, for indexing
>> purposes?
> Storing the complete text in an metadata field means that the full
> text will be retrieved every time the dataset is viewed, which could
> be pretty slow.
>
> I think a plugin that adds the text in
> IPackageController.before_search and a new field in the solr schema
> would be a good approach. This seems like something that would be very
> useful as a general-pupose extension. It's worth submitting a full
> description of how you would want such a feature to behave to
> https://github.com/ckan/ideas-and-roadmap/issues.
>
> Ian
> _______________________________________________
> ckan-dev mailing list
> ckan-dev at lists.okfn.org
> https://lists.okfn.org/mailman/listinfo/ckan-dev
> Unsubscribe: https://lists.okfn.org/mailman/options/ckan-dev
More information about the ckan-dev
mailing list