[pdb-discuss] pdw web interface (was british library screen scraping)
Rufus Pollock
rufus.pollock at okfn.org
Thu Jun 22 10:09:34 UTC 2006
remember to cc pdb-discuss by doing reply-all ;0
Nathan Lewis wrote:
[snip]
>>> Could I pin you down on some points? What is the overall plan for
>>> providing this data to the public? It looks like you are planning
>>> on using django but I am not certain. What database will you be using?
>>
>>
>> The simple answer is that this is up to us and hasn't been decided.
>> There are several options:
>>
>> 1) Build a site using a web application framework such as django. A
>> /very/ simple prototype that just used the as-shipped admin interface
>> was written back in April (as I wrote to the list then)
>>
> I didn't see anything other than raw textual data. I don't know much
> about django but I am not a python programmer but I am willing to learn
> it.
>
>> 2) Build something using an existing wiki system such as moinmoin.
>> This would work either by
>> a) using virtual page plugin (mentioned previously) we could show
>> data from db in wiki very easily. However this would be readonly so a
>> user could not edit the data.
>> b) write a script to create auto-create wiki pages using our data.
>> Advantage is that it is easy to add data and it is a familiar
>> interface. Problem with this is data is edited in wiki (so not saved
>> back to a db) and hard to do data validation (foreign keys, data
>> types etc). Perhaps this problem could be addressed in some manner
>> but it would be quite a bit of work and you end up moving towards (1).
>
>
> I don't think the data is at all useful unless it is in a database
> where it can be queried. Therefore I would say make this a django
Completely agree (see below).
> application rather than trying to use a wiki. If it possible to set up
> little mini wiki within each django page for gathering comments and
> annotation that would be ideal but a pure wiki would not give usable
> data in the long term.
+1. Entirely in agreement here. A big aim of this project is to provide
open reusable data and a wiki is *not* the way to go on that front.
>>
>> 3) ... suggestions for other ideas welcome
>>
>> IMO, in the long run I think we need to do something like (1).
>> However this is substantially more complex. As a first iteration and
>> in order to get something done by Saturday when Tom is going to demo
>> at iCommons I would suggest we work on 2b).
>
>
> Yes, I basically agree. But I think getting something going in django
> or ruby on rails might be easier than going the wiki route. If nothing
> else there is Catalyst, Perl's equivalent which would allow rapid
> development of a site.
I think there agreement on this that to do this properly we want a
proper db and an associated webapp providing a web interface. This
inteface might have wiki-like features (versioning, easy commenting etc)
but it would not be a wiki.
However in the interests of having a usable crude prototype ready by
*this saturday* for Tom to demo at iCommons I suggest we do a 'hack'
whereby we write a script to dump the db into a set of wiki pages.
>>
>> Whatever the case I think we need to extract the data to a db. Once
>> there we can decide what we do next.
>> 1. Extract data from site (DONE -- thanks to your efforts)
>> 2. Parse data from web pages (DONE -- again thanks to you)
>> 3. Put data in a db (I've nearly finished coding this)
>
>
> Cool. I consider 3. the be the key to making any further progress.
This is just about done. I've just committed a bunch of work including
python version of html parser and a basic domain model/persistence layer:
http://project.knowledgeforge.net/pdw/trac/changeset/7
By later this morning I should have finished all the code to dump into
the db. I will then write a script to dump from the db to a wiki.
[snip]
>>> I was ready and willing to finish the job of making the british
>>> library screen scraper last April but I felt unable to move forward
>>> without a roadmap.
>>
>>
>> I thought you'd already finished that job :) -- all the scraping and
>> parsing code you had written was working perfectly and all that
>> needed to be done in terms of data extraction was deciding on the
>> database structure so that we could put the data in. As you say the
>> real problem was deciding what interface to the data we wanted to
>> provide as that might influence the db structure. In any case my
>> apologies for not getting back to you properly in April.
>
>
> In my mind the bespoke screen scraper for the year 1900 was more a
> proof of concept. I don't think the year 1900 recordings are likely to
> be interesting to many if any artists. We need to scrape all the years
> in which stuff could be out of copyright. Artists who are actually
In fact we might as well try scraping /all/ years :)
> interested in a piece will probably be the ones most motivated to
> investigating the rights on it. Therefore I think we need to get it all
> into a database with a nice front end so that people can look at it and
> hopefully find something they want to use.
Completely agree. What I meant by my comments about having done the job
was that you had written a fully functioning scraper. Even if it is only
currently used for 1900 i think it can be trivially extended to scrape
everything (at least i hope so). The moment we have a functioning system
with 1900 we will start getting more data.
[snip]
Regards,
Rufus
More information about the pd-discuss
mailing list