[Open-access] Open Data Access Point in R

Christian Tzurcanu christian.tzurcanu at gmail.com
Mon Dec 23 17:13:45 UTC 2013


Dear members,

My proposal would be to have a thesaurus of navigation for an open data
catalog in multiple languages so I can plug it into
http://subject.ro/index.php?uri=uat.rdf (as a uri)
That way we can index data, messages/comments about the data, and offer
metadata back into R.

1. Why bring open data to R as a priority?
Because R has a very extensive library of algorithms and behavior that
complements any data.

2. Why subject.ro? What is subject.ro / what it plans to be?
We want to make it a general ontology based on the SKOS Thesaurus format.
For very easy human-led categorization as well as machine-readable.
We plan it as a gateway to liked and open data / indexed by subject. On the
web as well as in R.
our demo is for the ontology of Astronomy on their draft thesaurus (not
completed with all URIs) - so expect bugs from this side..

3. We would first begin with subject.ro data and behavior as a "portal" to
linked data in R because that will bring in R qualitative dimensions (by
controlled vocabularies). R is presently very good for quantitative data
but lacks ability to compute semantics.

4. subject.ro will keep it's data available for replication and
synchronization (for now in mysql, but with plans for CouchDB). We will
have mobile, desktop apps for interfacing with this data as well as in R
and the website. CouchDB is very good for distributed db.
We have extensive experience in programming for all platforms: web, mobile,
desktop for all operating systems. But we will need more volunteer
programmers for faster returns :)


Now I would like to talk about what we have already done to see the link to
our plans for subject.ro :
in http://sliced.ro/docs/docs/Science.html we have demoed some things that
we would like in R:

for each thesaurus we propose:
-have all terms at singular as number and masculine as gender (for the
appropriate languages)
-have the least number of words per term (prefer hyphenated and composed
word)
-have only the eponyms capitalized
-prefer the same number of words per term as the English term
-prefer to include in the term part of the inheritance (a term should
uniquely-define the reality without the need to know it's ancestry in the
graph) or have 2 versions of the thesauri: one with intrinsic identity and
one with possible extrinsic identity
-there should be just one preferred term for each language
-we also have to know for each term if it has single or multiple
inheritance

We should talk about each rule and I will tell you why I have reached these
conclusions. They are not the only possible solution.

Each language should have a function with the ability to form term's plural
and feminine forms.
There should be a function that takes in a text and a language code and be
able to compile a list of terms it contains.
There should be a function that takes in a text, a language and a target
language. It will return an exact translation for controlled terms and an
approximate translation of the rest using Google Translate.

As for Semantic Web processing: For any text+language:
There should be a function that returns the greater common term: the term
that contains all the other mentioned terms.
There should be a function that returns the smallest distinctors (an
invented idea): the terms that are the most detailed (the leaves in the
thesaurus graph)

Thesauri data and all these functions should be available in R (in the
subject.ro package).


We need scientific guidance on where this technology should lead and what
usecases can be derived. Please feedback.

Christian Tzurcanu, subject.ro
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.okfn.org/pipermail/open-access/attachments/20131223/44440fc5/attachment.html>


More information about the open-access mailing list