[open-bibliography] Fwd: FAST (Faceted Application of Subject Terminology)—Dataset

Jim Pitman pitman at stat.Berkeley.EDU
Thu Feb 16 17:58:05 UTC 2012


Etienne Posthumus <eposthumus at gmail.com> wrote:

> On 16 February 2012 17:47, Jim Pitman <pitman at stat.berkeley.edu> wrote:
> > Etienne, do you think you could do a parse to map these to some form of JSON/BibJSON and post e.g. to CKAN?
>
> Yes we could, but do we want to? I do not see the immediate value of
> converting a dump like this to JSON.

The value is that it lowers the barrier to use and processing of the dataset, hence increases the value of the dataset.
For example, I am interested in the structure of the math/stat section of FAST, and would be willing
to cross-link that to MSC2010 and other subject classification and navigation  efforts I have made in math/stat.
To do that, I have to parse the FAST data as it is at present, and map it to my data format of choice which
is JSON. I am not an expert in the formats FAST is currently in, and would do an inefficient job of mapping it,
and dont want to have to deal with the whole dataset.

This problem is generic. I have the same problem with all the big library dump datasets. I am seriously interested
in the math/stat portions of all of them, could add value to all of them by cross-linking and applying  my domain
expertise. But I should not have to be doing the heavy lifting of getting the datasets from where they are now
to extract what I care about from them. Every domain expert who is not well equipped with the tools to process large datasets 
has this problem.  This group should I think be doing all it can to engage  those domain experts and enabling
them to add value to these massive library dumps which otherwise will just sit where they are and not be made use of
by the people who could add value to them.

> The FAST data is already available as a resource with URIs that can be queried or linked to from a running BibServer.

But the returns are not in BibJSON, right?
Writing a parser to BibJSON over these services would be very useful. I am not sure if that would solve the problem of the user like me
who wants the entire subset of math/stat headings. I suppose I could spider over the current FAST interface to get what I wanted.
But it should be easier than that to get large chunks of the data defined in obvious ways, like all of math or all of stat.

> Why do we need to convert in bulk if we can link to a resource that is maintained?

I hope I have answered adequately above. It is possible and perhaps best if my use case can be met by writing a JSON wrapper around an existing
maintained API. I'm just not sure if that is adequate. I know if I had a complete dump in JSON I could just go through it and do it myself in half
an hour.  Sure, I could do that in RDF too, but it would take me some days to learn what that was all about, and parse the data for myself, and I know from 
past experience that time spent learning RDF/SPARQL/... is time wasted. I respect that there are industrial agents for which (perhaps) RDF is the appropriate framework.
But I dont need the friction of dealing with RDF. Gimme the JSON in bulk!

>  For example, (just pulling a random item from the air here) if we want to link this:
>   http://bnb.bibsoup.net/bibsoup/bnb/GB7409399
Nice. Better still is the JSON
http://bnb.bibsoup.net/bibsoup/bnb/GB7409399.json
The page should show itself how to find that (I just guessed and got lucky)
> we could programmitcally connect it to:
>   http://experimental.worldcat.org/fast/813004/
> and see where it leads from there.
Yes. But where is the JSON?
And how can I iterate this to get every record in BNB and in FAST that haas anything to do with math or stat?
That is my use case, and I dont see how it is met either by the library dumps or by current APIs.

> And the same goes for any other subject heading datasets. (many of which do not have well-maintained, or linkable, or crippled online
> versions, so there I can see the value of doing conversions)

My comments too apply to all such subject heading datasets. 

Many thanks for looking at this!

--Jim

----------------------------------------------
Jim Pitman
Professor of Statistics and Mathematics
University of California
367 Evans Hall # 3860
Berkeley, CA 94720-3860

ph: 510-642-9970  fax: 510-642-7892
e-mail: pitman at stat.berkeley.edu
URL: http://www.stat.berkeley.edu/users/pitman




More information about the open-bibliography mailing list