[open-science-dev] Data-Transcriber Follow up

Rufus Pollock rufus.pollock at okfn.org
Thu Jul 7 14:29:09 UTC 2011


cc'ing open-science-dev list <
http://lists.okfn.org/mailman/listinfo/open-science-dev>

I suggest we use that as coordination list for this work going forward -- at
least initially.

On 6 July 2011 22:08, Lucas Ferreira Mation <lucasmation at gmail.com> wrote:

> Dear all,
>
> It was great to see a demo materialize in the Hackfest. This email is to
> follow up on the work: introduce people and see who is interested, discuss
> how to push the development of the tool forward.
>
> In Brasil, besides me, we have 3 developers base at UFCG computer science
> department, in the NorthEast of Brazil, who intend to work on this:
> Nazareno, a professor, Nigini, a PHD student, and an undergrad assistant. We
> had a meeting today and we are willing to push this version into a fully
> functional Demo over the next month. Or at least try. But in order to do
> that I would be good for us to discuss the broad options and were we want to
> go.
>

Sounds great.


> The broad idea we discussed is to introduce Bossa<http://boinc.berkeley.edu/trac/wiki/BossaIntro>in the demo. Bossa will manage the users and job assignments (A "job"
> consists of a pair: "table image" + "unique googlespreadsheet"). This will
> mean coding in PHP and translating the current code to PHP.  For the moment
> we intended to stick with googleDocs of the table interface on the right
> hand.
>

Sounds good -- though I obviously prefer python (my PHP is non-existent) :-)
However I think we have an existing platform in BOSSA and more PHP people so
that sounds like a good call. We can fork the code right now in
DataDigitizer (or will this be part of core Bossa code base?)

One question though: BOINC seems to have bindings in python. Is there
anything like this in BOSSA. That way if one wanted to integrate with python
one could. Or to put it even more simply: does BOSSA have an API we can use.
This would mean we could even embed stuff with a javascript app into
anything ...


> Let me know if you guys are ok with this path. We could still change this
> things latter, but the idea is to have something working soon so that we can
> test it.
>
> If you can Let us know how each person can contribute (bellow a more
> detailed list)
>
> Besides the Brasilians , Daniel, François, Jenny,  Guo and Rufus (who were
> at the hackfest) I´m also incluing Javier Ruiz in this email. Javier has
> pointed me to http://scripto.org/ a smilar open souce tool for
> transcribing text that can even be used to generate tables (although the
> interface is not good (is the same as to generate tables a wikipedia
> article)). Also the crowdsourcing is wiki like, more fluid, with version
> control but with no explicit job assignment, volunteer management done. This
> it can be a good source for code but I would still use Bossa. Javier's group
> is involved in creating a platform form volunteer table transcription of
> genealogy records that have a more fixed template.
>

I think what we really want is clean, basic underlying system that can be
easily extended / plugged into. I don't think any of this is very hard and I
think we could move very quickly here :-)


> We were completing the Task list:
>

Looks good. I'd suggest adding these as issues on DataDigitizer github issue
tracker -- will be much easier to discuss and comment:

http://github.com/okfn/datadigitizer/issues

Rufus


> 1) Image Preparation  (DONE)
> 1.5) Rename images acording to some index, ex: "Book1page3.tif"
>
> 2) Job Management
> 2.1) Integrate BOSSA acount manager. (Credit can be atributed because each
> user is only sent to a page at the time)
>     2.1.1) Automaticaly creat a job when user identify table in document
>         a) associate that image (and others of the same page) to that job
>
>         b) creat a unique googleDocs Spreadsheet for that page (automate it
> using API)
>
3) User interface
> 3.1) Define action that creates jobs ("is there a table here"?)
> 3.3) Vizualize google spreadsheet on the right of the page (DONE)
>     3.3.1) take the headers off as much as possible (preferably only table)
> 3.4: Visulalize the form extracting
> 3.5) Create HTML form for metadata and embed in page
>
> 3.6) add mark rectangle tool  (not a priority now)
>     3.6.1) associate this info to OCR (done at the server) and return the
> result. User decides weather to use it or not.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/open-science-dev/attachments/20110707/0b9867ce/attachment.html>


More information about the open-science-dev mailing list