[Open-access] Trying to index the malaria literature for BOAI-Openness - what has to be done paper-by-paper?

Daniel Mietchen daniel.mietchen at googlemail.com
Tue Mar 13 20:39:48 UTC 2012


Just to sum up:

Google brings up over 80k hits for malaria on PMC:
https://www.google.com/search?q=malaria+site%3Awww.ncbi.nlm.nih.gov%2Fpmc ,

of which
216 PMC articles under CC0:
https://www.google.com/search?q=malaria+site%3Awww.ncbi.nlm.nih.gov%2Fpmc+%22Creative+Commons+Public+Domain%22

as well as over 9000 under CC BY:
https://www.google.com/search?q=malaria+site%3Awww.ncbi.nlm.nih.gov%2Fpmc+%22http%3A%2F%2Fcreativecommons.org%2Flicenses%2Fby%22+-%22by-sa%22++-%22by-nc%22++-%22by-nd%22

I do not know how Mark arrived at the 70k results for the set of
keywords, but the 80k for malaria contain phrases like "diseases like
malaria" (136x, cf.
https://www.google.com/search?q=%22diseases+like+malaria%22+site%3Awww.ncbi.nlm.nih.gov%2Fpmc
), which may not always be in the focus of Malaria World.

Peter, Mark and Sam - is the code for your Crawler available
somewhere? Or is there an app that I could use to text mine, say,
arXiv?

The code for our tool is at
https://github.com/erlehmann/open-access-media-importer , an outlie of
documentation at
http://en.wikiversity.org/wiki/User:OpenScientist/Open_grant_writing/Wissenswert_2011/Documentation/Crawler
.

Daniel

On Tue, Mar 13, 2012 at 6:05 PM, Peter Murray-Rust <pm286 at cam.ac.uk> wrote:
>
>
> On Tue, Mar 13, 2012 at 4:44 PM, Tom Olijhoek <tom.olijhoek at gmail.com>
> wrote:
>>
>> Hi All,
>>
>> Do I understand it right that you did a search for just one keyword
>> "malaria"?
>> PMR says that MW gave >70K references
>> I thought that Mark extracted >70K references using the set of keywords
>> provided by us (MW)?
>
>
> That's what I meant
>
>>
>> This is of no consequence to the observation that google search seems to
>> do a good job.
>> We recently had a discussion (Mark, Tom , Bart, Serge [admin MW]) that the
>> MW ref database of 2010-2011 will be compared to the pmc exracted references
>> for these years to see how identical the sets are.
>> That will be the basis for an extraction over a long time period using the
>> chosen keywords.
>> When you can use Google to sort the CC-BY and CC-0 it would be fantastic!
>>
>> TOM
>>
>>
>> On Tue, Mar 13, 2012 at 4:46 PM, Daniel Mietchen
>> <daniel.mietchen at googlemail.com> wrote:
>>>
>>> Google does not give a simple list of results, and it currently yields
>>> over 80k hits for malaria on PMC:
>>> https://www.google.com/search?q=malaria+site%3Awww.ncbi.nlm.nih.gov%2Fpmc
>>> .
>>>
>>> However, the Crawler currently being coded as part of the Open Access
>>> Media Importer (cf.
>>>
>>> http://wir.okfn.org/2012/03/10/open-access-media-importer-apology-frontend-usage/
>>> ) does almost what you are looking for, and so it should not be too
>>> difficult to modify it accordingly.
>>>
>>> I have copied Nils and Raphael - my partners on this - into this mail.
>>>
>>> Cheers,
>>>
>>> Daniel
>>>
>>>
>>> On Tue, Mar 13, 2012 at 4:14 PM, Peter Murray-Rust <pm286 at cam.ac.uk>
>>> wrote:
>>> >
>>> >
>>> > On Tue, Mar 13, 2012 at 3:09 PM, Daniel Mietchen
>>> > <daniel.mietchen at googlemail.com> wrote:
>>> >>
>>> >> Just seen in
>>> >>
>>> >>
>>> >> http://blogs.ch.cam.ac.uk/pmr/2012/03/13/sparc2012-a-manifesto-in-absentia-for-open-data/
>>> >> :
>>> >> "Our recent @ccess group is trying to index the malaria literature for
>>> >> BOAI-Openness and it has to be done paper-by-paper"
>>> >>
>>> >> I am not sure what PMR had in mind here that "has to be done
>>> >> paper-by-paper", but albeit PMC / PMUK do indeed a bad job in
>>> >> identifying papers by their licenses,
>>> >
>>> >
>>> > That's exactly what I meant!
>>> >>
>>> >> Google does it fairly well:
>>> >> 216 PMC articles under CC0 are being brought up in search for
>>> >> "malaria"
>>> >>
>>> >>
>>> >> https://www.google.com/search?q=malaria+site%3Awww.ncbi.nlm.nih.gov%2Fpmc+%22Creative+Commons+Public+Domain%22
>>> >> as well as over 9000 under CC BY:
>>> >>
>>> >>
>>> >> https://www.google.com/search?q=malaria+site%3Awww.ncbi.nlm.nih.gov%2Fpmc+%22http%3A%2F%2Fcreativecommons.org%2Flicenses%2Fby%22+-%22by-sa%22++-%22by-nc%22++-%22by-nd%22
>>> >> (for further stats, see
>>> >>
>>> >>
>>> >> http://wir.okfn.org/2011/07/14/a-wiki-approach-to-open-access-and-open-science/#comment-44
>>> >> ).
>>> >>
>>> >> Of course, the relevance of these papers for Malaria World cannot be
>>> >> established that way.
>>> >>
>>> > Malaria World gave us 70K references and the questions is how we
>>> > establish
>>> > the Openness of them. We can start with Google and maybe intersect the
>>> > sets.
>>> > Does google give a simple list of the 9000 papers?
>>> >
>>> >>
>>> >> _______________________________________________
>>> >> open-access mailing list
>>> >> open-access at lists.okfn.org
>>> >> http://lists.okfn.org/mailman/listinfo/open-access
>>> >
>>> >
>>> >
>>> >
>>> > --
>>> > Peter Murray-Rust
>>> > Reader in Molecular Informatics
>>> > Unilever Centre, Dep. Of Chemistry
>>> > University of Cambridge
>>> > CB2 1EW, UK
>>> > +44-1223-763069
>>>
>>> _______________________________________________
>>> open-access mailing list
>>> open-access at lists.okfn.org
>>> http://lists.okfn.org/mailman/listinfo/open-access
>>
>>
>
>
>
> --
> Peter Murray-Rust
> Reader in Molecular Informatics
> Unilever Centre, Dep. Of Chemistry
> University of Cambridge
> CB2 1EW, UK
> +44-1223-763069




More information about the open-access mailing list