[open-bibliography] Place of Publication data from the BL dataset

Jim Pitman pitman at stat.Berkeley.EDU
Thu Nov 25 15:35:03 UTC 2010


Ben O'Steen <bosteen at gmail.com> wrote:

> I've pulled out the place of publication(?) data (isbd:P1016) from the
> BL BNB dataset and compiled it into a sorted spreadsheet, focussed on
> the locations:
> http://bit.ly/g2l2tM

This is great to see. I had some difficulty however when I tried to harvest the spreadsheets for
closer inspection.  I also been experimenting with Google Docs as a means of publishing machine-readable
spreadsheet data, and found that by clicking "publish to the web" I was able to expose csv files such as this
one
https://spreadsheets.google.com/pub?key=0Ai_sd71RSo30dFoteW1MSS1KRWx6dVFQTW1JdnNVYWc&hl=en&output=csv
which can be accessed programatically e.g. using python urllib.urlopen()  without any complaint from Google.
I tried to guess the corresponding access url for one of your spreads, but could not make it work.
Please could you provide a list of 12 simple urls like the above from which the data can be pulled for closer inspection
by a simple python script?

I am very interested in the issue of how best to go about cleaning data like this, especially the use of crowdsourcing
for such purposes, and the provision of efficient workflows and tools for experts to work on it.
Two promising resources for such refinement exercises are

http://code.google.com/p/google-refine/
Google Refine is a power tool for working with messy data, cleaning it up, transforming it from one format into another, extending it with web services, and linking it to databases like Freebase. 

http://www.needlebase.com/
Needle is a revolutionary platform for acquiring, integrating, cleansing, analyzing and publishing data on the web.  Using Needle through a web browser, without programmers or DBAs, 
your data team can easily:
* acquire data from multiple sources:  A simple tagging process quickly imports structured data from complex websites, XML feeds, and spreadsheets into a unified database of your design.
* merge, deduplicate and cleanse: Needle uses intelligent semantics to help you find and merge variant forms of the same record.  Your merges, edits and deletions persist even after the original data is refreshed from its source.
* build and publish custom data views: Use Needle's visual UI and powerful query language to configure exactly your desired view of the data, whether as a list, table, grid, or map.  Then, with one click, publish the data for others to see, or export a feed of the clean data to your own local database.
Needle dramatically reduces the time, cost, and expertise needed to build and maintain comprehensive databases of practically anything. 

A general tool for crowdsourcing is https://www.mturk.com/mturk/

Some questions:

Does anyone on this list have experience using these tools? 
What other tools might be considered for this purpose?
How to manage the choice and application of such tools for big jobs like cleaning the BL data?
How can such jobs be done collaboratively so we dont end up with muliple copies of the data, each improved in some respect from the original, but not easily merged to obtain improvement in all respects?
Who decides what is an improvement?
Who is willing to host and maintain the improved dataset?

--Jim


----------------------------------------------
Jim Pitman
Director, Bibliographic Knowledge Network Project
http://www.bibkn.org/

Professor of Statistics and Mathematics
University of California
367 Evans Hall # 3860
Berkeley, CA 94720-3860

ph: 510-642-9970  fax: 510-642-7892
e-mail: pitman at stat.berkeley.edu
URL: http://www.stat.berkeley.edu/users/pitman




More information about the open-bibliography mailing list