[ECODP-dev] Release 09.00 Ingestion Update

John Glover john.glover at okfn.org
Thu Jul 11 07:19:02 UTC 2013


Hello,

I have performed an initial analysis on the data that is currently in the
test server database. We ran the rdf2ckan import using the data and command
provided by Dimitrios. There are a number of issues that I have found so
far:

* The database size is now much larger as the number of resources has
increased dramatically. There are currently 2301487 active resources, which
clearly seems like an error to me as there are less than 6000 datasets.
There are some datasets with huge numbers of resources. For example:
dataset ID 'fc8fb64a-3523-4b04-83e6-6e9e66bcceb9' has 5410 active datasets.
This will definitely have an effect on the speed of package_update calls,
and some adjustments may have to be made to either CKAN or the publisher
workflow if this is the expected number of resources.

Has the importer been run using the old rdf2ckan against the new CKAN
release? This would be useful in helping to track down the source of some
of these issues.


* We believe that the slowdown in package_update is due to the problem with
the large number of resources that are currently in the database, so our
current focus will be on finding the reason why so many extra resources are
being created.


* Here is an excerpt from the Apache access logs showing the API calls made
by rdf2ckan:

128.0.0.1 - - [10/Jul/2013:14:20:01 +0200] "POST
/data/api/action/package_update HTTP/1.1" 200 427196 "-"
"Apache-HttpClient/4.1.3 (java 1.5)"
127.0.0.1 - - [10/Jul/2013:14:24:04 +0200] "POST
/data/api/action/term_translation_update_many HTTP/1.1" 200 493 "-"
"Apache-HttpClient/4.1.3 (java 1.5)"
127.0.0.1 - - [10/Jul/2013:14:24:04 +0200] "POST
/data/api/action/term_translation_update_many HTTP/1.1" 200 493 "-"
"Apache-HttpClient/4.1.3 (java 1.5)"
127.0.0.1 - - [10/Jul/2013:14:24:05 +0200] "POST
/data/api/action/term_translation_update_many HTTP/1.1" 200 493 "-"
"Apache-HttpClient/4.1.3 (java 1.5)"
127.0.0.1 - - [10/Jul/2013:14:24:05 +0200] "POST
/data/api/action/package_update HTTP/1.1" 200 429040 "-"
"Apache-HttpClient/4.1.3 (java 1.5)"
127.0.0.1 - - [10/Jul/2013:14:27:49 +0200] "POST
/data/api/action/term_translation_update_many HTTP/1.1" 200 493 "-"
"Apache-HttpClient/4.1.3 (java 1.5)"
127.0.0.1 - - [10/Jul/2013:14:27:49 +0200] "POST
/data/api/action/term_translation_update_many HTTP/1.1" 200 493 "-"
"Apache-HttpClient/4.1.3 (java 1.5)"
127.0.0.1 - - [10/Jul/2013:14:27:49 +0200] "POST
/data/api/action/term_translation_update_many HTTP/1.1" 200 493 "-"
"Apache-HttpClient/4.1.3 (java 1.5)"
127.0.0.1 - - [10/Jul/2013:14:27:49 +0200] "POST
/data/api/action/package_update HTTP/1.1" 200 431560 "-"
"Apache-HttpClient/4.1.3 (java 1.5)"
127.0.0.1 - - [10/Jul/2013:14:30:50 +0200] "POST
/data/api/action/term_translation_update_many HTTP/1.1" 200 493 "-"
"Apache-HttpClient/4.1.3 (java 1.5)"
127.0.0.1 - - [10/Jul/2013:14:30:50 +0200] "POST
/data/api/action/term_translation_update_many HTTP/1.1" 200 493 "-"
"Apache-HttpClient/4.1.3 (java 1.5)"
127.0.0.1 - - [10/Jul/2013:14:30:50 +0200] "POST
/data/api/action/term_translation_update_many HTTP/1.1" 200 493 "-"
"Apache-HttpClient/4.1.3 (java 1.5)"
127.0.0.1 - - [10/Jul/2013:14:30:50 +0200] "POST
/data/api/action/package_update HTTP/1.1" 200 434181 "-"
"Apache-HttpClient/4.1.3 (java 1.5)"
127.0.0.1 - - [10/Jul/2013:14:34:54 +0200] "POST
/data/api/action/term_translation_update_many HTTP/1.1" 200 493 "-"
"Apache-HttpClient/4.1.3 (java 1.5)"
127.0.0.1 - - [10/Jul/2013:14:34:54 +0200] "POST
/data/api/action/term_translation_update_many HTTP/1.1" 200 493 "-"
"Apache-HttpClient/4.1.3 (java 1.5)"
127.0.0.1 - - [10/Jul/2013:14:34:54 +0200] "POST
/data/api/action/term_translation_update_many HTTP/1.1" 200 493 "-"
"Apache-HttpClient/4.1.3 (java 1.5)"
127.0.0.1 - - [10/Jul/2013:14:34:54 +0200] "POST
/data/api/action/package_update HTTP/1.1" 200 437413 "-"
"Apache-HttpClient/4.1.3 (java 1.5)"
127.0.0.1 - - [10/Jul/2013:14:42:51 +0200] "POST
/data/api/action/term_translation_update_many HTTP/1.1" 200 493 "-"
"Apache-HttpClient/4.1.3 (java 1.5)"
127.0.0.1 - - [10/Jul/2013:14:42:51 +0200] "POST
/data/api/action/term_translation_update_many HTTP/1.1" 200 493 "-"
"Apache-HttpClient/4.1.3 (java 1.5)"
127.0.0.1 - - [10/Jul/2013:14:42:51 +0200] "POST
/data/api/action/term_translation_update_many HTTP/1.1" 200 493 "-"
"Apache-HttpClient/4.1.3 (java 1.5)"

As you can see, for each dataset that is updated, 3 separate calls to
term_translation_update_many are made. These 3 calls can all be made as a
single call, as you can pass a list of terms to
term_translation_update_many. This would reduce the load on the server
during ingestion.


* Here is an excerpt of the content of the term_translation_update_many
calls that is being received by CKAN:

[Wed Jul 10 14:15:11 2013] [{u'lang_code': u'de', u'term_translation':
u'Mehr Informationen auf der Eurostat Website', u'term': u'More information
on Eurostat Website'}]
[Wed Jul 10 14:15:11 2013] [{u'lang_code': u'en', u'term_translation':
u'Eurostat, the statistical office of the European Union', u'term':
u'Eurostat, the statistical office of the European Union'}]
[Wed Jul 10 14:15:11 2013] [{u'lang_code': u'fr', u'term_translation':
u'Plus dinformation sur le site web dEurostat', u'term': u'More information
on Eurostat Website'}]
[Wed Jul 10 14:20:01 2013] [{u'lang_code': u'de', u'term_translation':
u'Mehr Informationen auf der Eurostat Website', u'term': u'More information
on Eurostat Website'}]
[Wed Jul 10 14:20:01 2013] [{u'lang_code': u'en', u'term_translation':
u'Eurostat, the statistical office of the European Union', u'term':
u'Eurostat, the statistical office of the European Union'}]
[Wed Jul 10 14:20:01 2013] [{u'lang_code': u'fr', u'term_translation':
u'Plus dinformation sur le site web dEurostat', u'term': u'More information
on Eurostat Website'}]
[Wed Jul 10 14:24:04 2013] [{u'lang_code': u'de', u'term_translation':
u'Mehr Informationen auf der Eurostat Website', u'term': u'More information
on Eurostat Website'}]
[Wed Jul 10 14:24:04 2013] [{u'lang_code': u'en', u'term_translation':
u'Eurostat, the statistical office of the European Union', u'term':
u'Eurostat, the statistical office of the European Union'}]
[Wed Jul 10 14:24:05 2013] [{u'lang_code': u'fr', u'term_translation':
u'Plus dinformation sur le site web dEurostat', u'term': u'More information
on Eurostat Website'}]
[Wed Jul 10 14:27:49 2013] [{u'lang_code': u'de', u'term_translation':
u'Mehr Informationen auf der Eurostat Website', u'term': u'More information
on Eurostat Website'}]
[Wed Jul 10 14:27:49 2013] [{u'lang_code': u'en', u'term_translation':
u'Eurostat, the statistical office of the European Union', u'term':
u'Eurostat, the statistical office of the European Union'}]
[Wed Jul 10 14:27:49 2013] [{u'lang_code': u'fr', u'term_translation':
u'Plus dinformation sur le site web dEurostat', u'term': u'More information
on Eurostat Website'}]
[Wed Jul 10 14:30:50 2013] [{u'lang_code': u'de', u'term_translation':
u'Mehr Informationen auf der Eurostat Website', u'term': u'More information
on Eurostat Website'}]
[Wed Jul 10 14:30:50 2013] [{u'lang_code': u'en', u'term_translation':
u'Eurostat, the statistical office of the European Union', u'term':
u'Eurostat, the statistical office of the European Union'}]
[Wed Jul 10 14:30:50 2013] [{u'lang_code': u'fr', u'term_translation':
u'Plus dinformation sur le site web dEurostat', u'term': u'More information
on Eurostat Website'}]

So, rdf2ckan seems to be sending the same translations repeatedly, and as
noted above, this is all done in separate API calls instead of a single
call.


Regards,
John
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.okfn.org/mailman/private/ecodp-dev/attachments/20130711/f4f677e3/attachment.html>


More information about the ecodp-dev mailing list