[open-science-dev] Data-Transcriber Follow up

Nigini Abilio nigini at lsd.ufcg.edu.br
Fri Jul 8 03:01:22 UTC 2011

Hi everyone.

First of all, I'm following Rufus idea of using the email list. In this way,
we simplify sending messages, and start a communication archive.

I'm Nigini from Brazil, I'm starting my PhD studies in the great area of
collective intelligence, and I wish (with my advisor Nazareno) to be helpful
for the data-digitizer project. As Lucas told, we are planning to make
another "month-step" focusing two points:

   1. gain more experience with Bossa (and the other code produced) so we
   can build/suggest another prototype;
   2. think about the workflow design related to the system usage by the

With step 1, we'll try to bring more insights for the technical people, so
we all can decide the technology and architecture to be used. But the main
idea is to maximize reuse (minimizing re-code) of these opensource tools.
So, as Rufus asked about BOINC/BOSSA integration, it would be really nice to
have a way to integrate our Python code, as we all have more knowledge at
this technology.

Again, I agree with Rufus when talking about documenting things at Github,
as it has code, tasks and text integration tools, and user friendly


On Thu, Jul 7, 2011 at 11:29 AM, Rufus Pollock <rufus.pollock at okfn.org>wrote:

> cc'ing open-science-dev list <
> http://lists.okfn.org/mailman/listinfo/open-science-dev>
> I suggest we use that as coordination list for this work going forward --
> at least initially.
> On 6 July 2011 22:08, Lucas Ferreira Mation <lucasmation at gmail.com> wrote:
>> Dear all,
>> It was great to see a demo materialize in the Hackfest. This email is to
>> follow up on the work: introduce people and see who is interested, discuss
>> how to push the development of the tool forward.
>> In Brasil, besides me, we have 3 developers base at UFCG computer science
>> department, in the NorthEast of Brazil, who intend to work on this:
>> Nazareno, a professor, Nigini, a PHD student, and an undergrad assistant. We
>> had a meeting today and we are willing to push this version into a fully
>> functional Demo over the next month. Or at least try. But in order to do
>> that I would be good for us to discuss the broad options and were we want to
>> go.
> Sounds great.
>> The broad idea we discussed is to introduce Bossa<http://boinc.berkeley.edu/trac/wiki/BossaIntro>in the demo. Bossa will manage the users and job assignments (A "job"
>> consists of a pair: "table image" + "unique googlespreadsheet"). This will
>> mean coding in PHP and translating the current code to PHP.  For the moment
>> we intended to stick with googleDocs of the table interface on the right
>> hand.
> Sounds good -- though I obviously prefer python (my PHP is non-existent)
> :-) However I think we have an existing platform in BOSSA and more PHP
> people so that sounds like a good call. We can fork the code right now in
> DataDigitizer (or will this be part of core Bossa code base?)
> One question though: BOINC seems to have bindings in python. Is there
> anything like this in BOSSA. That way if one wanted to integrate with python
> one could. Or to put it even more simply: does BOSSA have an API we can use.
> This would mean we could even embed stuff with a javascript app into
> anything ...
>> Let me know if you guys are ok with this path. We could still change this
>> things latter, but the idea is to have something working soon so that we can
>> test it.
>> If you can Let us know how each person can contribute (bellow a more
>> detailed list)
>> Besides the Brasilians , Daniel, François, Jenny,  Guo and Rufus (who were
>> at the hackfest) I´m also incluing Javier Ruiz in this email. Javier has
>> pointed me to http://scripto.org/ a smilar open souce tool for
>> transcribing text that can even be used to generate tables (although the
>> interface is not good (is the same as to generate tables a wikipedia
>> article)). Also the crowdsourcing is wiki like, more fluid, with version
>> control but with no explicit job assignment, volunteer management done. This
>> it can be a good source for code but I would still use Bossa. Javier's group
>> is involved in creating a platform form volunteer table transcription of
>> genealogy records that have a more fixed template.
> I think what we really want is clean, basic underlying system that can be
> easily extended / plugged into. I don't think any of this is very hard and I
> think we could move very quickly here :-)
>> We were completing the Task list:
> Looks good. I'd suggest adding these as issues on DataDigitizer github
> issue tracker -- will be much easier to discuss and comment:
> http://github.com/okfn/datadigitizer/issues
> Rufus
>> 1) Image Preparation  (DONE)
>> 1.5) Rename images acording to some index, ex: "Book1page3.tif"
>> 2) Job Management
>> 2.1) Integrate BOSSA acount manager. (Credit can be atributed because each
>> user is only sent to a page at the time)
>>     2.1.1) Automaticaly creat a job when user identify table in document
>>         a) associate that image (and others of the same page) to that job
>>         b) creat a unique googleDocs Spreadsheet for that page (automate
>> it using API)
> 3) User interface
>> 3.1) Define action that creates jobs ("is there a table here"?)
>> 3.3) Vizualize google spreadsheet on the right of the page (DONE)
>>     3.3.1) take the headers off as much as possible (preferably only
>> table)
>> 3.4: Visulalize the form extracting
>> 3.5) Create HTML form for metadata and embed in page
>> 3.6) add mark rectangle tool  (not a priority now)
>>     3.6.1) associate this info to OCR (done at the server) and return the
>> result. User decides weather to use it or not.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/open-science-dev/attachments/20110708/fd266c8b/attachment.html>

More information about the open-science-dev mailing list