[open-science-dev] Data-Transcriber Follow up
Daniel Lombraña González
teleyinex at gmail.com
Fri Jul 8 08:39:21 UTC 2011
I support also the idea of using the e-mail list for this project.
About the questions of BOSSA. Well, BOSSA is a different tool from BOINC but
shares mostly all the code from BOINC. BOINC has some Python
and there isn't an API for BOSSA, but maybe we can create it (BOINC has some
web RPC methods for creating users, getting credit, etc.). However, its work
flow is pretty simple: as a developer or researcher you only have to create
some scripts with some specific callback functions for showing the jobs to
the volunteers, submit the results, etc, etc. (here you have the
The only problem right now for me is that BOSSA or BOINC do not support HTML
templates, so the HTML and the PHP code is completely mixed, but even though
this is not perfect it is very easy to deploy an application in a very short
time. You can have a look at the tutorials
and if you have problems, you can ask me directly ;)
Therefore, should we move everything to BOSSA right now? I'm OK with this
"fork" as BOSSA has some features that will make easier to manage the
volunteers, jobs, etc.
PS: Does this list have rules about how e-mails format should be? I mean,
HTML e-mails are forbidden or allowed?
On Fri, Jul 8, 2011 at 05:01, Nigini Abilio <nigini at lsd.ufcg.edu.br> wrote:
> Hi everyone.
> First of all, I'm following Rufus idea of using the email list. In this
> way, we simplify sending messages, and start a communication archive.
> I'm Nigini from Brazil, I'm starting my PhD studies in the great area of
> collective intelligence, and I wish (with my advisor Nazareno) to be helpful
> for the data-digitizer project. As Lucas told, we are planning to make
> another "month-step" focusing two points:
> 1. gain more experience with Bossa (and the other code produced) so we
> can build/suggest another prototype;
> 2. think about the workflow design related to the system usage by the
> With step 1, we'll try to bring more insights for the technical people, so
> we all can decide the technology and architecture to be used. But the main
> idea is to maximize reuse (minimizing re-code) of these opensource tools.
> So, as Rufus asked about BOINC/BOSSA integration, it would be really nice to
> have a way to integrate our Python code, as we all have more knowledge at
> this technology.
> Again, I agree with Rufus when talking about documenting things at Github,
> as it has code, tasks and text integration tools, and user friendly
> On Thu, Jul 7, 2011 at 11:29 AM, Rufus Pollock <rufus.pollock at okfn.org>wrote:
>> cc'ing open-science-dev list <
>> I suggest we use that as coordination list for this work going forward --
>> at least initially.
>> On 6 July 2011 22:08, Lucas Ferreira Mation <lucasmation at gmail.com>wrote:
>>> Dear all,
>>> It was great to see a demo materialize in the Hackfest. This email is to
>>> follow up on the work: introduce people and see who is interested, discuss
>>> how to push the development of the tool forward.
>>> In Brasil, besides me, we have 3 developers base at UFCG computer science
>>> department, in the NorthEast of Brazil, who intend to work on this:
>>> Nazareno, a professor, Nigini, a PHD student, and an undergrad assistant. We
>>> had a meeting today and we are willing to push this version into a fully
>>> functional Demo over the next month. Or at least try. But in order to do
>>> that I would be good for us to discuss the broad options and were we want to
>> Sounds great.
>>> The broad idea we discussed is to introduce Bossa<http://boinc.berkeley.edu/trac/wiki/BossaIntro>in the demo. Bossa will manage the users and job assignments (A "job"
>>> consists of a pair: "table image" + "unique googlespreadsheet"). This will
>>> mean coding in PHP and translating the current code to PHP. For the moment
>>> we intended to stick with googleDocs of the table interface on the right
>> Sounds good -- though I obviously prefer python (my PHP is non-existent)
>> :-) However I think we have an existing platform in BOSSA and more PHP
>> people so that sounds like a good call. We can fork the code right now in
>> DataDigitizer (or will this be part of core Bossa code base?)
>> One question though: BOINC seems to have bindings in python. Is there
>> anything like this in BOSSA. That way if one wanted to integrate with python
>> one could. Or to put it even more simply: does BOSSA have an API we can use.
>> anything ...
>>> Let me know if you guys are ok with this path. We could still change this
>>> things latter, but the idea is to have something working soon so that we can
>>> test it.
>>> If you can Let us know how each person can contribute (bellow a more
>>> detailed list)
>>> Besides the Brasilians , Daniel, François, Jenny, Guo and Rufus (who
>>> were at the hackfest) I´m also incluing Javier Ruiz in this email. Javier
>>> has pointed me to http://scripto.org/ a smilar open souce tool for
>>> transcribing text that can even be used to generate tables (although the
>>> interface is not good (is the same as to generate tables a wikipedia
>>> article)). Also the crowdsourcing is wiki like, more fluid, with version
>>> control but with no explicit job assignment, volunteer management done. This
>>> it can be a good source for code but I would still use Bossa. Javier's group
>>> is involved in creating a platform form volunteer table transcription of
>>> genealogy records that have a more fixed template.
>> I think what we really want is clean, basic underlying system that can be
>> easily extended / plugged into. I don't think any of this is very hard and I
>> think we could move very quickly here :-)
>>> We were completing the Task list:
>> Looks good. I'd suggest adding these as issues on DataDigitizer github
>> issue tracker -- will be much easier to discuss and comment:
>>> 1) Image Preparation (DONE)
>>> 1.5) Rename images acording to some index, ex: "Book1page3.tif"
>>> 2) Job Management
>>> 2.1) Integrate BOSSA acount manager. (Credit can be atributed because
>>> each user is only sent to a page at the time)
>>> 2.1.1) Automaticaly creat a job when user identify table in document
>>> a) associate that image (and others of the same page) to that job
>>> b) creat a unique googleDocs Spreadsheet for that page (automate
>>> it using API)
>> 3) User interface
>>> 3.1) Define action that creates jobs ("is there a table here"?)
>>> 3.3) Vizualize google spreadsheet on the right of the page (DONE)
>>> 3.3.1) take the headers off as much as possible (preferably only
>>> 3.4: Visulalize the form extracting
>>> 3.5) Create HTML form for metadata and embed in page
>>> 3.6) add mark rectangle tool (not a priority now)
>>> 3.6.1) associate this info to OCR (done at the server) and return the
>>> result. User decides weather to use it or not.
Por favor, NO utilice formatos de archivo propietarios para el
intercambio de documentos, como DOC y XLS, sino HTML, RTF, TXT, CSV
o cualquier otro que no obligue a utilizar un programa de un
fabricante concreto para tratar la información contenida en él.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the open-science-dev