[ECODP-dev] Release 09.00 Ingestion Update

Bert Van Nuffelen bert.van.nuffelen at tenforce.com
Thu Jul 11 07:29:50 UTC 2013


Hi John,

thanks for the update. One update on our site.

We took yesterday afternoon a total step back from every investigation we
did before.
As you have seen, we encountered large json's to be uploaded in CKAN. We
did an investigation and it turns out that during the ingest of a package
we add the resources from dataset N to the dataset N+1.
That explains the large json.  We are testing now on our dev server the
process times and it as been improved.

Can you explain us how this can affect the overall performance  / single
operation performance?
Is this due to the number of resources per dataset, is it due to the total
number of resources?
What are the impacting factors here?

On the translations, it is inherent to the ingestion process, making it
hard to join the calls.
Each dataset is treated independently, and as each dataset contains the
same translation, the same request will be sent out.

best,

Bert




2013/7/11 John Glover <john.glover at okfn.org>

> Hello,
>
> I have performed an initial analysis on the data that is currently in the
> test server database. We ran the rdf2ckan import using the data and command
> provided by Dimitrios. There are a number of issues that I have found so
> far:
>
> * The database size is now much larger as the number of resources has
> increased dramatically. There are currently 2301487 active resources, which
> clearly seems like an error to me as there are less than 6000 datasets.
> There are some datasets with huge numbers of resources. For example:
> dataset ID 'fc8fb64a-3523-4b04-83e6-6e9e66bcceb9' has 5410 active datasets.
> This will definitely have an effect on the speed of package_update calls,
> and some adjustments may have to be made to either CKAN or the publisher
> workflow if this is the expected number of resources.
>
> Has the importer been run using the old rdf2ckan against the new CKAN
> release? This would be useful in helping to track down the source of some
> of these issues.
>
>
> * We believe that the slowdown in package_update is due to the problem
> with the large number of resources that are currently in the database, so
> our current focus will be on finding the reason why so many extra resources
> are being created.
>
>
> * Here is an excerpt from the Apache access logs showing the API calls
> made by rdf2ckan:
>
> 128.0.0.1 - - [10/Jul/2013:14:20:01 +0200] "POST
> /data/api/action/package_update HTTP/1.1" 200 427196 "-"
> "Apache-HttpClient/4.1.3 (java 1.5)"
> 127.0.0.1 - - [10/Jul/2013:14:24:04 +0200] "POST
> /data/api/action/term_translation_update_many HTTP/1.1" 200 493 "-"
> "Apache-HttpClient/4.1.3 (java 1.5)"
> 127.0.0.1 - - [10/Jul/2013:14:24:04 +0200] "POST
> /data/api/action/term_translation_update_many HTTP/1.1" 200 493 "-"
> "Apache-HttpClient/4.1.3 (java 1.5)"
> 127.0.0.1 - - [10/Jul/2013:14:24:05 +0200] "POST
> /data/api/action/term_translation_update_many HTTP/1.1" 200 493 "-"
> "Apache-HttpClient/4.1.3 (java 1.5)"
> 127.0.0.1 - - [10/Jul/2013:14:24:05 +0200] "POST
> /data/api/action/package_update HTTP/1.1" 200 429040 "-"
> "Apache-HttpClient/4.1.3 (java 1.5)"
> 127.0.0.1 - - [10/Jul/2013:14:27:49 +0200] "POST
> /data/api/action/term_translation_update_many HTTP/1.1" 200 493 "-"
> "Apache-HttpClient/4.1.3 (java 1.5)"
> 127.0.0.1 - - [10/Jul/2013:14:27:49 +0200] "POST
> /data/api/action/term_translation_update_many HTTP/1.1" 200 493 "-"
> "Apache-HttpClient/4.1.3 (java 1.5)"
> 127.0.0.1 - - [10/Jul/2013:14:27:49 +0200] "POST
> /data/api/action/term_translation_update_many HTTP/1.1" 200 493 "-"
> "Apache-HttpClient/4.1.3 (java 1.5)"
> 127.0.0.1 - - [10/Jul/2013:14:27:49 +0200] "POST
> /data/api/action/package_update HTTP/1.1" 200 431560 "-"
> "Apache-HttpClient/4.1.3 (java 1.5)"
> 127.0.0.1 - - [10/Jul/2013:14:30:50 +0200] "POST
> /data/api/action/term_translation_update_many HTTP/1.1" 200 493 "-"
> "Apache-HttpClient/4.1.3 (java 1.5)"
> 127.0.0.1 - - [10/Jul/2013:14:30:50 +0200] "POST
> /data/api/action/term_translation_update_many HTTP/1.1" 200 493 "-"
> "Apache-HttpClient/4.1.3 (java 1.5)"
> 127.0.0.1 - - [10/Jul/2013:14:30:50 +0200] "POST
> /data/api/action/term_translation_update_many HTTP/1.1" 200 493 "-"
> "Apache-HttpClient/4.1.3 (java 1.5)"
> 127.0.0.1 - - [10/Jul/2013:14:30:50 +0200] "POST
> /data/api/action/package_update HTTP/1.1" 200 434181 "-"
> "Apache-HttpClient/4.1.3 (java 1.5)"
> 127.0.0.1 - - [10/Jul/2013:14:34:54 +0200] "POST
> /data/api/action/term_translation_update_many HTTP/1.1" 200 493 "-"
> "Apache-HttpClient/4.1.3 (java 1.5)"
> 127.0.0.1 - - [10/Jul/2013:14:34:54 +0200] "POST
> /data/api/action/term_translation_update_many HTTP/1.1" 200 493 "-"
> "Apache-HttpClient/4.1.3 (java 1.5)"
> 127.0.0.1 - - [10/Jul/2013:14:34:54 +0200] "POST
> /data/api/action/term_translation_update_many HTTP/1.1" 200 493 "-"
> "Apache-HttpClient/4.1.3 (java 1.5)"
> 127.0.0.1 - - [10/Jul/2013:14:34:54 +0200] "POST
> /data/api/action/package_update HTTP/1.1" 200 437413 "-"
> "Apache-HttpClient/4.1.3 (java 1.5)"
> 127.0.0.1 - - [10/Jul/2013:14:42:51 +0200] "POST
> /data/api/action/term_translation_update_many HTTP/1.1" 200 493 "-"
> "Apache-HttpClient/4.1.3 (java 1.5)"
> 127.0.0.1 - - [10/Jul/2013:14:42:51 +0200] "POST
> /data/api/action/term_translation_update_many HTTP/1.1" 200 493 "-"
> "Apache-HttpClient/4.1.3 (java 1.5)"
> 127.0.0.1 - - [10/Jul/2013:14:42:51 +0200] "POST
> /data/api/action/term_translation_update_many HTTP/1.1" 200 493 "-"
> "Apache-HttpClient/4.1.3 (java 1.5)"
>
> As you can see, for each dataset that is updated, 3 separate calls to
> term_translation_update_many are made. These 3 calls can all be made as a
> single call, as you can pass a list of terms to
> term_translation_update_many. This would reduce the load on the server
> during ingestion.
>
>
> * Here is an excerpt of the content of the term_translation_update_many
> calls that is being received by CKAN:
>
> [Wed Jul 10 14:15:11 2013] [{u'lang_code': u'de', u'term_translation':
> u'Mehr Informationen auf der Eurostat Website', u'term': u'More information
> on Eurostat Website'}]
> [Wed Jul 10 14:15:11 2013] [{u'lang_code': u'en', u'term_translation':
> u'Eurostat, the statistical office of the European Union', u'term':
> u'Eurostat, the statistical office of the European Union'}]
> [Wed Jul 10 14:15:11 2013] [{u'lang_code': u'fr', u'term_translation':
> u'Plus dinformation sur le site web dEurostat', u'term': u'More information
> on Eurostat Website'}]
> [Wed Jul 10 14:20:01 2013] [{u'lang_code': u'de', u'term_translation':
> u'Mehr Informationen auf der Eurostat Website', u'term': u'More information
> on Eurostat Website'}]
> [Wed Jul 10 14:20:01 2013] [{u'lang_code': u'en', u'term_translation':
> u'Eurostat, the statistical office of the European Union', u'term':
> u'Eurostat, the statistical office of the European Union'}]
> [Wed Jul 10 14:20:01 2013] [{u'lang_code': u'fr', u'term_translation':
> u'Plus dinformation sur le site web dEurostat', u'term': u'More information
> on Eurostat Website'}]
> [Wed Jul 10 14:24:04 2013] [{u'lang_code': u'de', u'term_translation':
> u'Mehr Informationen auf der Eurostat Website', u'term': u'More information
> on Eurostat Website'}]
> [Wed Jul 10 14:24:04 2013] [{u'lang_code': u'en', u'term_translation':
> u'Eurostat, the statistical office of the European Union', u'term':
> u'Eurostat, the statistical office of the European Union'}]
> [Wed Jul 10 14:24:05 2013] [{u'lang_code': u'fr', u'term_translation':
> u'Plus dinformation sur le site web dEurostat', u'term': u'More information
> on Eurostat Website'}]
> [Wed Jul 10 14:27:49 2013] [{u'lang_code': u'de', u'term_translation':
> u'Mehr Informationen auf der Eurostat Website', u'term': u'More information
> on Eurostat Website'}]
> [Wed Jul 10 14:27:49 2013] [{u'lang_code': u'en', u'term_translation':
> u'Eurostat, the statistical office of the European Union', u'term':
> u'Eurostat, the statistical office of the European Union'}]
> [Wed Jul 10 14:27:49 2013] [{u'lang_code': u'fr', u'term_translation':
> u'Plus dinformation sur le site web dEurostat', u'term': u'More information
> on Eurostat Website'}]
> [Wed Jul 10 14:30:50 2013] [{u'lang_code': u'de', u'term_translation':
> u'Mehr Informationen auf der Eurostat Website', u'term': u'More information
> on Eurostat Website'}]
> [Wed Jul 10 14:30:50 2013] [{u'lang_code': u'en', u'term_translation':
> u'Eurostat, the statistical office of the European Union', u'term':
> u'Eurostat, the statistical office of the European Union'}]
> [Wed Jul 10 14:30:50 2013] [{u'lang_code': u'fr', u'term_translation':
> u'Plus dinformation sur le site web dEurostat', u'term': u'More information
> on Eurostat Website'}]
>
> So, rdf2ckan seems to be sending the same translations repeatedly, and as
> noted above, this is all done in separate API calls instead of a single
> call.
>
>
> Regards,
> John
>
>
> _______________________________________________
> Ecodp-dev mailing list
> Ecodp-dev at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/ecodp-dev
>
>


-- 
Bert Van Nuffelen

Semantic Technologies Software Architect at TenForce
www.tenforce.be

Bert.Van.Nuffelen at tenforce.com
Office: +32 (0)16 31 48 60
Mobile:+32 479 06 24 26
skype: bert.van.nuffelen
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.okfn.org/mailman/private/ecodp-dev/attachments/20130711/388d8857/attachment.html>


More information about the ecodp-dev mailing list