[openbiblio-dev] Opaque URIs - to encode or not encode?

Mon Nov 1 15:15:34 UTC 2010

In giving URIs for authors, publishers, and so on - a reference-able URI
in place of a simple literal value - we have a number of criteria:

1 - It has to be a unique URI, specific not only to a given literal
instance, but within a record.

2 - It has to be as globally unique as possible, as the URI provides a
hook on which services like sameas.org can function.

3 - The conjured ID should also be a valid URI - which places
limitations on character composition so that it is 'url safe' and so on.

------------

Possiblities then:

- Claim a URI prefix and have an auto-incrementing id generated for any
literal value or node you wish to reference. Eg:

_:Book a bibo:Book
   dcterms:creator  [ rdf:value "John S Smythe"] .

-- to ->

@prefix myprefix: <http://host/literals/>

_:Book a bibo:Book
   dcterms:creator myprefix:102 .

myprefix:102 a foaf:Person
   foaf:name "John S Smythe" .

Pros:
 Nicer looking URIs
 Easier URIs to copy and paste

Cons:
- Every source that does this has to mint their own prefix, as well as
making sure that the URIs dereference in some manner. Using an info:
based prefix might avoid this if resources are tight.

- Possibility of collisions unless the handing out of ids are handled
properly, for example using a Redis's INCR function, or Flickr's notable
use of a MySQL-specific SQL command [*]

- Unfounded attribution of significance to the literal number.

- Very difficult to re-run the process over a portion of data and get
the same numbers that were generated on a first pass.

* -
http://code.flickr.com/blog/2010/02/08/ticket-servers-distributed-unique-primary-keys-on-the-cheap/

Summary - Good for a single pass over a dataset, fast and as reliable as
your chosen ticket/id server.

UUID(4 or 5) ids:

previous example -- to -->

_:Book a bibo:Book
   dcterms:creator <urn:uuid:63e8fa91..........> .

<urn:uuid:63e8fa91..........> a foaf:Person
   foaf:name "John S Smythe" .

Pros:
  Easy to generate
  Given a sufficient source of entropy, the UUIDs can be generated
independently and are very, very unlikely to collide.
  UUIDs from distributed systems can be aggregated with little fear of
collision - a collision would imply that the URIs are actually from the
same original source.

Cons:
  Very large size - a size which gives them the ability to avoid random
collision also makes them difficult to use (from a human perspective)
  Etc etc