[openbiblio-dev] query to bibliographica.org

Thu Apr 7 22:40:03 UTC 2011

William and all, my comments inline below.
--Jim

> * [2011-04-05 12:12:36 -0700] Jim Pitman <pitman at stat.Berkeley.EDU> écrit:
> ] We managed eventually to solve these problems, using solr for the search,
> ] and BibJSON http://www.bibkn.org/bibjson/index.html to help bridge from
> ] RDF and various ontologies (BIBO, FOAF ... ) to JSON. 

William Waites <ww at styx.org> wrote:
>
> So the question here is, do we use BibJSON directly or use its mapping
> to RDF? Since the storage is in terms of RDF I think we use its
> mapping.  But in that case, as consensus builds on how to represent
> RDF in JSON generally in a useable way - and there is active effort
> happening right now at the W3C for how to do this which I am involved
> with - why don't we just use that directly? In principle we gain
> compatibility and extensibility that way. 

If you think you can stay the course for Bibliographica with RDF storage and
adequate solr search over that rather than JSON storage like CouchDB, then
fine. Compatibility and extensibility are very desirable.

> Since bibliographica is not really about books but is about higher order 
> relationships between books and people, relationships that are not well defined 
> and will almost always be open to several interpretations, I think extensibility is 
> crucial.

I completely agree. One of the reasons I developed  people.bibkn.org was to push on the extensibility issue for both RDF and JSON in the biblio context. People records from different sources are much more diverse than book or article records, and there is little standardization of fields yet. The fields are typically dataset specific, which makes JSON inviting. You just have to deal at the end with mapping whatever fields you have got to search indexes and displays. Of course, similar issues arise if you try to match a bunch of book records, some MARC, some BibTeX, some ....

> ] I hope Bibliographica will be flexible enough to
> ] include datasets like those in http://people.bibkn.org/ and to provide
> ] comparable or better performance for data retrieval by webservice.
>
> Now this is pretty easy, in principle. One thing missing is content type
> negotiation:
>
>     curl -H Accept:application/rdf+xml http://people.bibkn.org/wsf/datasets/mathscinet_mrauth/363517
>
> should give me back RDF (there is an RDF/XML export widget on the web
> page) but gives me back an HTML page. If that is added, then individuals
> can trivially be added, and they can already be referenced from
> Bibliographica. This is if it is to be done piecemeal, like some sort
> of cross between authorclaim and what we now have, "add so and so that
> I found on people.bibkn.org as an author of this book" or "add the
> people.bibkn.org URI as an equivalent identifier for this author".

Sounds good.

> Doing a bulk import would be easy enough as well, though I note that
> while our presentation of books is in reasonable shape, the presentation
> of authors still uses the old fresnel technique which is not so good...

Hopefully that is fairly easily fixed?

> I also notice that, looking at the JSON output of people.bibkn.org it
> looks a lot like the SPARQL results format or something closely related
> to some of the RDF-JSON variants and so has some of the same problems as
> the regular JSON output from SPARQL endpoints do

I am sure you are right. There  is quite a bit of tension between RDF and JSON.
What drives the data model of a site like people.bibkn.org is you have multiple
datasets about people, each dataset more or less a table with well defined fields,
but fields varying from dataset to dataset. The only thing you can guarantee is there
for each person a text representation of the name and some indication of the dataset
where this name was found, and an id in the dataset which can hopefully but not
necessarily  resolved into a URL with more information. There may be lots more,
and if there is more it is nice to be able to search over it in the hosting of the
dataset. That is a big benefit to dataset providers,  to have adequate search over
their own data, which they may not have in its native state.

> ] How to get from
> ] here to there? All I can suggest is that the Bibliographica developers might 
> ] take a good look at the functionality achieved by people.bibn.org ,
> ] consider how that functionality was achieved, and think about how to adapt/embed 
> ] that functionality in bibliographica.
>
> I would not like to bulk copy the data from people.bibkn.org because I
> think it is important that we think in terms of a distributed system
> where it is quite reasonable for data to live in multiple places. 

I completely agree. What I would rather suggest is providing adequate  functionality
in Bibliographica that you can replicate the value add from upload of diverse people
datasets like this from different sources. The dream was that some reward would come
just from  getting the data into RDF. The reality has been this has not really happened,
except for what can be provided by the solr search which is great. But this is not
really RDF. You could just as well do the search over CouchDB.

> We can copy small amounts of data, e.g. people's names, as an
> optimisation but really we should not mint separate identifiers for
> them and allow them to live elsewhere. Ideally the people themselves
> might manage thier own profile somewhere and we would like to use that
> as their identifier.  

I agree, profile management should be pushed out the data providers. Point of the
aggregation is just to provide unified search/browse/webservice over the collection
of data from different sources.

> So I'd say that one thing we can do is work on a query protocol where we can search for people e.g. by name and get their identifiers back, not unlike the search mechanism I 
> just wrote about in another mail in this thread. 

Sounds OK.

> I would imagine this mechanism
> to be used by the web ui javascript and not particularly by the server
> side software, and they would be knitted together similarly to the way
> wikipedia gadgets work. I'm not sure how understandable that was, but there it is.

I dont think I follow the last paragraph, but everything earlier sounded fine.
To summarize my view of this. You have multiple agents (e.g. me on behalf of UC Berkeley or the probability community) willing to provide and maintain people lists,  like all 
faculty at UC Berkeley, or all researchers in probability, in JSON. BibJSON is developed as needed to formalize this usage.
Bibliographica provides a way to upload these lists, or polls these lists from time to
time, and allows an aggregated search and retrieve functionality for items in these lists. Critically important is each list may have different attributes, must allow flexibility and extensibility there, immediately accomodated by JSON, and with effort by RDF.
Also critically important is that when the entry is returned, the original JSON is
available. Users expect to provide and receive data in JSON. System may use RDF for
data management and export to other systems if that is considered useful or desirable.
I remain skeptical of the rewards from RDF for this use case, but maybe it can be
demonstrated.

--Jim