[okfn-help] ORDF ticket #53: work with vanilla rdflib, remove 4store dependencies.

Graham Higgins gjh at bel-epa.com
Wed Jun 16 05:15:06 UTC 2010


I'll just take this opportunity to flag what might lie ahead and, for  
general interest, to fold in a few other discoveries that I made...

On 15 Jun 2010, at 18:52, William Waites wrote:
> I'm not convinced this is an issue.


FWIW, I took my lead from the members of the rdflib development team,  
they seem reasonably convinced that it is an issue, e.g. this from May  
12th:

http://groups.google.com/group/rdflib-dev/msg/9286c516853ec539

"The windows build was always problematic. That's why the C sparql  
parser was swapped for a pure python one later.
Unfortunately the pure python parser was never 100% polished, and was  
never released."


> At least now we are in a state where our code works but some queries
> that are possible to do (which are not the ones that our UI does in
> the normal course of events) may take unacceptably long to run.

Ah, I was viewing ORDF primarily as an independent library and I  
assessed the situation in terms of whether it could compromise the  
functioning of ORDF as a library. I have to admit, I didn't even think  
of considering the ticket from the more pragmatic perspective of ORDF  
being primarily a component of Bibliographica.

I guess that my other main concern was that both of the rdflib SPARQL  
parsers are acknowledged to have ragged edges which, in some  
circumstances, cause them to behave in an apparently mysterious  
fashion. There are a number of rdflib SPARQL tickets that have been  
marked 'wontfix' with an accompanying comment referencing a pure  
Python SPARQL in rdflib 2.5... this, for one, observing that OPTIONAL  
silently fails to function (except if you've chosen to use MySQL for  
the back-end store in which case it fails noisily <http://code.google.com/p/rdflib/issues/detail?id=86 
 >):

> What steps will reproduce the problem?
>
> >>> from rdflib.graph import ConjunctiveGraph
> >>> from rdflib.term import URIRef
> >>> x = ConjunctiveGraph()
> >>> x.add((URIRef("http://a" URIRef("http://b" URIRef("http://c"))
> >>> list(x.query("SELECT ?s  WHERE { ?s ?p ?o  } "))
> [rdflib.term.URIRef('http://a')
> >>> # Expect to get the same result with OPTIONAL:
> >>> list(x.query("SELECT ?s  WHERE { OPTIONAL { ?s ?p ?o }  } "))
> []


<http://code.google.com/p/rdflib/issues/detail?id=92>

(Note, this issue persists for rdflib 2.4.2, 2.5.0 and 3.0.0)

 From my perspective, composing SPARQL queries is a demanding enough  
task of itself without wishing to complicate matters further for end- 
users by subjecting them to silently failing bugs/missing features, or  
even noisily failing ones with potentially misleading error messages,  
such as this keyword case sensitivity issue:

> What steps will reproduce the problem?
> 1. Try a sparql query using "construct" instead of "CONSTRUCT"
>
> What is the expected output? What do you see instead?
> Example error...
> SyntaxError: lexical error at line 1, column 58: no action found for
> 'construct { ?s swb:label ?o } where { ?s swb:label ?o . }'
> [ ... ]
> The probable cause is line 51 of lexical analyzer:
>        ./src/bison/SPARQLLiteralLexerPatterns.bgen.frag
>
> <pattern expression='CONSTRUCT'>
>     <token>CONSTRUCT</token>
>   </pattern>
>
> should be <pattern expression='construct|CONSTRUCT'>
>
> Probably need to also fix DESCRIBE and ASK

(That's not limited to just DESCRIBE and ASK. According to the SPARQL  
spec: "Keywords are matched in a case-insensitive manner" (A.8 Grammar  
<http://www.w3.org/TR/rdf-sparql-query/>) and inspection of the  
referenced bison source file reveals a number of other incorrectly  
case-sensitive keyword expressions.)

I think there's a real danger that this kind of thing will quickly  
drive end-users to distraction if they step outside the friendly  
confines of the UI --- which, IMO, is likely to happen sooner rather  
than later. In common with the OpenRDF/Sesame2 UI, the Bibliographica  
UI provides structured support for the addition of statements to the  
store, for browsing and navigating the store but retrieval of chunks  
of stuff from the store remains unstructured --- a textarea box and  
one's own SPARQL-fu. I suspect that this is very likely to remain the  
case until distinct domain-specific query patterns begin to emerge and  
explicit support for them can be added to the UI.

Even so, there are (to me) fascinatingly complex IA (Information  
Architecture) issues about to present themselves. I found this <http://www.mail-archive.com/dev@rdflib.net/msg00222.html 
 > rdflib-dev discussion quite fascinating. david "whit" morriss asks  
about an apparently simple query:

> I'm trying to do a match like so:
>
> select distinct ?person
> where {
>      ?person foaf:knows ?a .
>      ?person foaf:knows ?b .
> }
>
> In plain terms, I want all people who know both ?a and ?b. My query  
> works
> if I include just a single pattern (ala "?person foaf:knows ?a" or "? 
> person
> foaf:knows ?b") but fails if both patterns are present, despite [the
> existence of] triples representing persons knowing [both] ?a and ?b.
>
> is this a bug or is my syntax wrong (and if so, how does one do this?)

which leads on to Chimezie explaining the intricacies of  
DAWG_DATASET_COMPLIANCE and its impact on the querying of graphs:

"Some explanation of this switch (DAWG_DATASET_COMPLIANCE) might shed  
some light.  If you follow the SPARQL specification verbatim, the  
'default' graph is always the first one that is matched.  The  
'default' graph is a graph without an identifier - or at least, its  
identifier cannot be matched.  This breaks with RDFLib where *all*  
Graphs have an identifier which can be matched (note the Graph.quads  
method).  In the absence of a default graph specified explicitly as  
the dataset for the query (via FROM <..>), the default graph is an  
empty graph, so any pattern without a GRAPH directive will always  
match nothing! In order to comply with SPARQL, the assumption had to  
be made (when the compliance flag is set to True) that the default  
context for a ConjunctiveGraph is *the* default graph (as defined by  
SPARQL)."

(For info, the compliance flag is currently set to False in rdflib  
2.4.2 and 2.5.0)

After some discussion, whit ventures...

> > soooooo... if I add a FROM <my-graph-uri> my problem should be  
> solved?

and Chimezie responds:

> FROM <my-graph-uri> will cause the RDF dereferenced from that URI to
> comprise the default graph.  Since the default graph is matched by
> default, any SPARQL patterns in the query will be matched against that
> graph (as long as GRAPH is not used)


I'm already dreading having to explain this sort of thing to end-users.

What I find interesting about DAWG_DATASET_COMPLIANCE is that it's an  
official switch of SPARQL's graph-matching behaviour and as such, it  
is a perfectly valid requirement to expect to be supported. A user,  
independently discovering its existence and purpose in the SPARQL  
spec, is likely to expect that an arbitrary SPARQL endpoint provides  
the features consonant with the advertised SPARQL version for that  
endpoint, including access to switchable compliance.

It seems likely to impact users directly, depending on the nature of  
the SPARQL queries they need to pose --- this is drewpca on the subject:

> I have it turned off in my biggest project, so I'm in favor of  
> making that mode available. If I understand it correctly,  
> DAWG_DATASET_COMPLIANCE=False (search all graphs) is also the  
> default behavior for the sesame http sparql server. Proposal: move  
> this setting to a kw arg on the query method. I suspect most people  
> with projects of any size wrap graph.query with their own call  
> anyway, to pass the same initNs in each time. A  
> DAWG_DATASET_COMPLIANCE flag would not be a burden to that group.  
> Other people would either ignore the setting (and use default or  
> explicit graphs in all queries) or set it per-query if they have to.


Again, I'm not looking forward to explaining this to users.

At the risk of overdoing it... I took Whit's dataset and the three  
SPARQL queries that he describes and I tried them out in three  
different versions of rdflib: 2.4.2, 2.5.0 and 3.0.0 [code attached].  
It mostly worked as Chimezie described... except that in each  
successful trial, the DISTINCT keyword is apparently ignored and two  
identical results are returned:

Query:
======
> PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
> PREFIX annotea: <http://www.w3.org/2000/10/annotation-ns#>
> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
> SELECT DISTINCT ?entry_uri
> WHERE {
>
>       {
>         ?annotation annotea:related ?entry_uri .
>         ?annotation annotea:body ?term_uri .
>         ?term_uri rdfs:label ?term.filter(?term = 'test2')
>       }
>        UNION
>       {
>         ?annotation annotea:related ?entry_uri .
>         ?annotation annotea:body ?term_uri .
>         ?term_uri rdfs:label ?term.filter(?term = 'test3')
>       }
> }

Result:
=======
> <sparql:sparql xmlns:sparql="http://www.w3.org/2005/sparql-results#">
>   <sparql:head>
>     <sparql:variable name="entry_uri"/>
>     <sparql:variable name="term"/>
>     <sparql:variable name="annotation"/>
>     <sparql:variable name="term_uri"/>
>   </sparql:head>
>   <sparql:results ordered="false" distinct="true">
>     <sparql:result>
>       <sparql:binding name="entry_uri">
>         <sparql:uri>http://annotation.openplans.org/entry/urn:uuid:60a76c80-d399-11d9-b91C-0003939e0af6 
> </sparql:uri>
>       </sparql:binding>
>     </sparql:result>
>     <sparql:result>
>       <sparql:binding name="entry_uri">
>         <sparql:uri>http://annotation.openplans.org/entry/urn:uuid:60a76c80-d399-11d9-b91C-0003939e0af6 
> </sparql:uri>
>       </sparql:binding>
>     </sparql:result>
>   </sparql:results>
> </sparql:sparql>



The same dataset and queries produced the expected singleton result  
when tested on a
sesame2 server.
Having pointed out an example of apparently mysterious SPARQL  
behaviour in rdflib, I should also acknowledge that SPARQL itself  
takes few prisoners. Whit observes that the UNION-style approach is  
the only option, as a naive query (shown below) simply returns no  
results:

> WHERE  {
>       ?annotation annotea:related ?entry_uri .
>       ?annotation annotea:body ?term_uri .
>       ?term_uri rdfs:label ?term.filter(?term = 'test2')
>       ?term_uri rdfs:label ?term.filter(?term = 'test3')
> }


Unfortunately, the user is left wondering whether there really /are/  
no results or whether there's a flaw in the logic of the construction  
of the SPARQL query. It's in these circumstances that oddities in  
SPARQL behaviour can be extremely misleading and very exasperating,  
leading to poor levels of user experience - as even seasoned developer  
whit observes: "ok... makes me feel a little less nuts."

(BTW, the problem that whit experienced with 3 UNIONs has been fixed)

On the whole, I thought there was just too much downside to embracing  
rdflib in its current state but then again I'm not in the best  
position to weigh up the pros and cons vs how frequently people are  
encountering show-stopping 4store compile/install problems.

-- 
Cheers,

Graham

http://www.linkedin.com/in/ghiggins


-------------- next part --------------
A non-text attachment was scrubbed...
Name: whitquery.py
Type: text/x-python-script
Size: 7158 bytes
Desc: not available
URL: <http://lists.okfn.org/pipermail/okfn-help/attachments/20100616/96afcbbf/attachment-0001.bin>
-------------- next part --------------




-------------- next part --------------
A non-text attachment was scrubbed...
Name: PGP.sig
Type: application/pgp-signature
Size: 370 bytes
Desc: This is a digitally signed message part
URL: <http://lists.okfn.org/pipermail/okfn-help/attachments/20100616/96afcbbf/attachment.sig>


More information about the okfn-help mailing list