[Open-Legislation] state of the union
stef
stefan.marsiske at gmail.com
Sat Jan 22 14:33:18 UTC 2011
On Sat, Jan 22, 2011 at 09:25:45AM +0100, Stefan Sels wrote:
> i just wanna know what is the current "plan"....
i'm focusing on pippi to be hones, i'm planning to do some work on memopol2,
and i hope that i can heavily contribute to something that is similar to
tratten. further my goal is to integrate these tools. tracking documents
during the legislative pipeline using rcs and possibly xml is on my radar.
> Like are you programming folks working on the right XML format ?
no. we should maybe agree what such a format should encompass first.
what kind of metadata we want to include, what structure, what objects, are
going only for acts or also for bills maybe ammendments, etc.
this is very nice (from Francis' mail), but only a final view on Acts not
bills: http://www.opsi.gov.uk/legislation-api/developer/formats/xml
in the eu the is a monster called compromise-ammendment. afaik it will be very
hard to be able to track the 'author' in such cases, except if we have a
transcription of the com
> Because I would love to start crunching some PDF files, even if it
> is not the right format, just to see what I can get out of those
> files.
regarding pdfs, i have just completed a prototype that is able on one test
document to identify the footnotes, and move them from the bottom of the pages
to the end of the document. this makes parsing the text much easier, as it is
not interrupted by the footnotes, headers and footers on each page bottom.
i have some ideas how to use neural nets to classify textblocks into
footnote,heading,para,listing categories. so we can dump the pdf as text and
reverse-engineer the raw text back into some mark-uped format.
cheers,s
--
gpg: https://www.ctrlc.hu/~stef/stef.gpg
gpg fp: F617 AC77 6E86 5830 08B8 BB96 E7A4 C6CF A84A 7140
More information about the open-legislation
mailing list