[open-linguistics] JSON-NLP and code for NLP Microservices

Fri May 17 14:34:19 UTC 2019

Dear colleagues,

please forgive me for posting some shameless advertisement for various
technologies and software packages in the NLP domain that I was working
on over the last year or two. Given that the papers and proposals for
demos or presentations documenting this work have all been rejected by
NAACL, ACL and other conferences, me and my students still believe that
this is of interest to a broader NLP community, in particular when it
comes to real deployment of NLP technologies in scalable and big data
related environments. We will put the papers, a whitepaper, and the
JSON-NLP and API documentation up on Arxiv and publish the tutorials
online linked from http://nlp-lab.org/.

Here the shameless ad:

First, we have specified a JSON-NLP Schema at:
https://github.com/dcavar/JSON-NLP

This is a schema that provides (a|our) standard for NLP-pipeline
outputs. This standard goes far beyond any other alternative with
respect to coverage, NLP annotation data, and so on. Currently it is
just covering syntactic specifications, syntactic here means the format
in which we wrap NLP outputs using JSON, it does not mean that we are
focusing on syntax only. ☺ In the next level, we will expand the
standard to cover the semantics (or at least parts of) of labels and
content.

We have generated JSON validators for this Schema in Java and Python,
with various other languages following soon. We work on an optimization
of the JSON format to optimize instantiation of data objects in Java
and Python for applications in real environments. The JSON-NLP standard
is closely related to JSON-LD in the sense that we build mappers from
JSON-NLP to JSON-LD based entity representations, knowledge graphs or
general knowledge representations.

We see this JSON format as an NLP-information exchange format for in
particular programming language independent, scalable and distributed
RESTful NLP Microservices using the State of the Art NLP models and
technologies.

For numerous extremely good and useful NLP components and pipelines we
have generated wrappers to emit a JSON-NLP format from the pipeline
output:
- Flair
  (https://github.com/dcavar/Flair-JSON-NLP and
   https://pypi.org/project/flairjsonnlp/)
- spaCy
  (https://github.com/dcavar/spaCy-JSON-NLP and
   https://pypi.org/project/spacyjsonnlp/)
- Stanford CoreNLP
- OpenNLP
- LingPipe
- Xrenner
- Polyglot
- various modules and components of NLTK (with broad coverage)
- Foma (using a Java JNI interface)
- ...

All these wrappers are written in native code for the respective
modules, Python and Java in particular. C(++) technologies we wrapped
in Java JNI. They provide specific extensions and caching strategies to
improve the performance of all the NLP components and pipelines in a
scalable productive environment.

The Python modules are published on PyPi and available on GitHub. See
the links on http://nlp-lab.org/ for more details. The Java modules
will be available soon on your favorite Maven repo. These modules
include a class that consumes JSON-NLP and provides an API interface to
functions that enable detailed access to various kinds of linguistic
annotations, including translations of phrase structure trees to scope
relations of terminals or phrases, or segmentation of sentences into
clauses and extraction of clause level information, hierarchies, and
sub-trees (dependency or constituent), and so on.

The advantage of this JSON-NLP standard is also that we now have a
unifier that can merge the outputs of multiple NLP-pipelines into one
JSON-NLP format, which allows us to combine NLP pipeline outputs that
cover different levels of annotation, as well as to validate the
outputs by detecting mismatches, and so on.

In addition to these core JSON-NLP wrappers and translators, we provide
wrappers for the NLP components and pipelines to enable an ensemble of
RESTful Microservices. In Python each pipeline can be launched as a
Flask-based local Microservice, or it can be wrapped into a server-
based WSGI environment. In Java (11 or newer) we use JAX-RS and a
WildFly-based infrastructure for deployment.

The NLP services can be used from any common RESTful environment,
including basically any language that supports IP-based communication
to a http-port and some form of JSON processing.

A lot more technologies in this context will appear soon on my lab-
page, including the wrappers and interfaces for common Knowledge
Graphs, graph extraction from text, various neural and symbolic
similarity algorithms, and so on.

I set up a basic testing environment at 
https://api.linguistic.technology/ (which requires a login and
password), which is performant and responsive, but still running on
simple testing machine, and all running services I consider BETA at
best. In the next days we might provide access to a machine at Indiana
University that is accessible with a better bandwidth. If you would be
interested in investing some time and giving me some feedback on your
potential use-case for the JSON-NLP standard, interest in features and
properties, issues you see, I would be very grateful. Please contact me
via email to discuss this. Any other feedback on the Python code, Java
modules, the JSON-NLP schema is very welcome.

Sincerely

DC

-- 
Dr. Damir Cavar
Associate Professor
Indiana University
http://damir.cavar.me/
https://nlp-lab.org/