[open-bibliography] Place of Publication data from the BL dataset

Christopher Gutteridge cjg at ecs.soton.ac.uk
Thu Nov 25 22:47:10 UTC 2010


If you want a really cheap and cheerful but on-the-fly conversion for csv:
http://graphite.ecs.soton.ac.uk/csv2rdf/

This is a little pet tool I've been working on. Source available on 
request. It requires about 10 hours of coding to make really useful but 
I don't have the slack.

While it's pretty noddy, it does do stuff on-the-fly!
Very hokey, but:
http://is.gd/hN0lw

Jim Pitman wrote:
> Ben O'Steen <bosteen at gmail.com> wrote:
>
>   
>> I've pulled out the place of publication(?) data (isbd:P1016) from the
>> BL BNB dataset and compiled it into a sorted spreadsheet, focussed on
>> the locations:
>> http://bit.ly/g2l2tM
>>     
>
> This is great to see. I had some difficulty however when I tried to harvest the spreadsheets for
> closer inspection.  I also been experimenting with Google Docs as a means of publishing machine-readable
> spreadsheet data, and found that by clicking "publish to the web" I was able to expose csv files such as this
> one
> https://spreadsheets.google.com/pub?key=0Ai_sd71RSo30dFoteW1MSS1KRWx6dVFQTW1JdnNVYWc&hl=en&output=csv
> which can be accessed programatically e.g. using python urllib.urlopen()  without any complaint from Google.
> I tried to guess the corresponding access url for one of your spreads, but could not make it work.
> Please could you provide a list of 12 simple urls like the above from which the data can be pulled for closer inspection
> by a simple python script?
>
> I am very interested in the issue of how best to go about cleaning data like this, especially the use of crowdsourcing
> for such purposes, and the provision of efficient workflows and tools for experts to work on it.
> Two promising resources for such refinement exercises are
>
> http://code.google.com/p/google-refine/
> Google Refine is a power tool for working with messy data, cleaning it up, transforming it from one format into another, extending it with web services, and linking it to databases like Freebase. 
>
> http://www.needlebase.com/
> Needle is a revolutionary platform for acquiring, integrating, cleansing, analyzing and publishing data on the web.  Using Needle through a web browser, without programmers or DBAs, 
> your data team can easily:
> * acquire data from multiple sources:  A simple tagging process quickly imports structured data from complex websites, XML feeds, and spreadsheets into a unified database of your design.
> * merge, deduplicate and cleanse: Needle uses intelligent semantics to help you find and merge variant forms of the same record.  Your merges, edits and deletions persist even after the original data is refreshed from its source.
> * build and publish custom data views: Use Needle's visual UI and powerful query language to configure exactly your desired view of the data, whether as a list, table, grid, or map.  Then, with one click, publish the data for others to see, or export a feed of the clean data to your own local database.
> Needle dramatically reduces the time, cost, and expertise needed to build and maintain comprehensive databases of practically anything. 
>
> A general tool for crowdsourcing is https://www.mturk.com/mturk/
>
> Some questions:
>
> Does anyone on this list have experience using these tools? 
> What other tools might be considered for this purpose?
> How to manage the choice and application of such tools for big jobs like cleaning the BL data?
> How can such jobs be done collaboratively so we dont end up with muliple copies of the data, each improved in some respect from the original, but not easily merged to obtain improvement in all respects?
> Who decides what is an improvement?
> Who is willing to host and maintain the improved dataset?
>
> --Jim
>
>
> ----------------------------------------------
> Jim Pitman
> Director, Bibliographic Knowledge Network Project
> http://www.bibkn.org/
>
> Professor of Statistics and Mathematics
> University of California
> 367 Evans Hall # 3860
> Berkeley, CA 94720-3860
>
> ph: 510-642-9970  fax: 510-642-7892
> e-mail: pitman at stat.berkeley.edu
> URL: http://www.stat.berkeley.edu/users/pitman
>
> _______________________________________________
> open-bibliography mailing list
> open-bibliography at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/open-bibliography
>   

-- 
Christopher Gutteridge -- http://id.ecs.soton.ac.uk/person/1248

You should read the ECS Web Team blog: http://blogs.ecs.soton.ac.uk/webteam/





More information about the open-bibliography mailing list