[okfn-be] presentation

Thu Dec 15 23:49:11 UTC 2011

Thanks everyone for the welcome :)

Oh, I have a question for this list: what would be the legal
implications of the releasing of structured data from official website
of the Belgian Government (what I'm doing in fact) on the licence of
the resulted data? And also would I/we risk legals stuff?

While this won't stop me I prefer to be aware of it.

Just to clarify, my plan for the moment is to build a
{lachambre,dekamer}2json tool with a (probably RESTful) API for
everyone to build on top of it. It will probably have an interface
like Parltrack with stream of events.

And I intent to parse in the more of less following order:
- Deputies (the information of their page) [already done in fact]
- Commission [mostly done]
- Documents linked to deputies (the gray box on their page) [I have
  the list of everything except the agenda]
- Find every stream of informations and build RSS/other on top of it
  to see what the hell in happening right now in it.

Oh and for those who don't know what is Parltrack/Memopol:
- Parltrack: http://parltrack.euwiki.org
- memopol: https://memopol.lqdn.fr

Pieter Colpaert wrote:
> Would you be able to attend our next meeting?

Sadly no :/

I already have something planned and I'm not very motivated to cancel
it. Otherwise I'm living in Brussels for the next occasions.

Jan Vangrinsven wrote:
> For the moment I'm evaluating a semantic search engine, and as you can
> guess I decided to focus on data from 'dekamer.be' and also the
> senaat.be.
>
> So far I scraped the pdf's and transformed them to text, and next
> uploaded them into our system. Maybe we could compare the work we're
> doing.

Oh, that's a great news! Those pdfs seems horrible to parse, did you
managed to get some structured informations out of it? If so, would
you considered doing a set of small cli tools each one dedicated to a
kind of document? That would be really awesome.

Also, which tools are you using? I was looking at pdftotext -layout to
reconstruct the documents.

On Thu, Dec 15, 2011 at 11:32:01AM +0100, Lieven Janssen wrote:
> how do you exactly parse the website? Do you use a framework? Which
> programming language?

In short python/django-nonrel(to use nosql as a db)/mongodb/BeautifulSoup and I
may use lxml with it's BeautifulSoup interface but for the moment I
don't feel the need to. It's basically the technologies with which I'm
the most effective with. Except mongodb which is new to me, I've choose
it after the experience of Stf on Parltrack.

I'll probably published my code under agplv3+ pretty soon, don't know
when for the moment.

> Within the iRail NPO we developed The DataTank a platform which can be used
> to easily scrape a website and turn it into a RESTful API returning JSON,
> XML, ...

I've taken a look at it but this is very far away from my usual way of working
and mostly technologies that I'm not used to. Normally that wouldn't have been
a problem but here I'm more focus on having a result than learning
something new.

-- 

Laurent Peuch -- Bram
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 490 bytes
Desc: Digital signature
URL: <http://lists.okfn.org/pipermail/okfn-be/attachments/20111216/01bab29a/attachment-0001.sig>