[openbiblio-dev] References in Wikipedia and BibJSON

Tom Morris tfmorris at gmail.com
Wed Feb 15 18:22:14 UTC 2012


The curly braces may make the MediaWiki format look vaguely like JSON,
but it's not related.  It's a homebrew syntax which has no BNF or
other formal definition and is not related to any other syntax known
to man or beast.  This is one of the reasons that there are so few
parsers for it.

The DBpedia extraction framework would something worth investigating
as a starting point if you decide you do want to parse articles.  They
focus on infoboxes and I don't think they do extraction of other
article elements, but since infoboxes and citations are both just
types of templates, it probably wouldn't be to hard to extend.  Also,
they should be extracting "Template:Infobox_book" information.
http://en.wikipedia.org/wiki/Template:Infobox_book

Freebase WEX is a preprocessed version of the Wikipedia articles which
might be slightly easier to deal with if you decide to do your own
parser.  It's generated bi-weekly from English Wikipedia.
http://wiki.freebase.com/wiki/WEX

WEX is produced using one of Magnus Manske's parsers, but the
MediaWiki parser landscape has changed quite a bit since Metaweb made
that selection years ago.  There's a list of other alternative
MediaWiki parsers at:
http://www.mediawiki.org/wiki/Alternative_parsers  Using something
that tracks the direction that the core MediaWiki development is going
would ensure more longevity.

The MediaWiki API provides access to template calls.  I've used this
in the past to extract infobox data and the same API could be used to
extract Template:Citation calls instead of Template:Infobox calls (the
Template:Cite in Peter's example is a synonym for Template:Citation).
http://en.wikipedia.org/wiki/Template:Cite#Books

In terms of freshness, DBpedia Live is updated much more frequently
than the standard DBpedia (basically within seconds).  You can also
run your own extractors using their code at whatever frequency makes
economic sense for you.

Tom

On Wed, Feb 15, 2012 at 7:22 AM, Chris Maloney <voldrani at gmail.com> wrote:
> Hi,
> I just started lurking here recently, and I'm not sure exactly what you are
> trying to do, but have you checked to see if this information isn't already
> extracted and available through DBPedia ( http://dbpedia.org/About )?
> It seems you're trying to extract and resolve bibliographic information from
> Wikipedia articles, right?  I'm not an expert on DBPedia (I've only just
> read a little bit) but this seems to be exactly in line with DBPedia's core
> mission, and I'd be very surprised if they don't have this data available
> already.  At the very least, take a look at their extraction framework:
>  http://wiki.dbpedia.org/Documentation , which could perhaps provide the
> tool to mine for this data yourselves.
>
>
> On Wed, Feb 15, 2012 at 4:06 AM, Etienne Posthumus
> <etienne.posthumus at okfn.org> wrote:
>>
>> On 14 February 2012 17:35, Jim Pitman <pitman at stat.berkeley.edu> wrote:
>> > There is an API call which I have used before. The return is not very
>> > structured, but its fairly
>> > clean and typically better than trying to scrape the raw html, which
>> > they discourage anyway. It should be easy
>> > to rip out the bibitems from the API call.  I'll look in my files for
>> > how the API call works and respond
>> > again on this.
>>
>> Just adding Jim's answer (from a separate mail) to this thread. To see
>> raw Wikipedia output, do:
>> http://en.wikipedia.org/w/index.php?action=raw&title=Vega_(rocket)
>>
>> _______________________________________________
>> openbiblio-dev mailing list
>> openbiblio-dev at lists.okfn.org
>> http://lists.okfn.org/mailman/listinfo/openbiblio-dev
>
>
>
> _______________________________________________
> openbiblio-dev mailing list
> openbiblio-dev at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/openbiblio-dev
>




More information about the openbiblio-dev mailing list