[openbiblio-dev] Opaque URIs - to encode or not encode?
Ben O'Steen
bosteen at gmail.com
Mon Nov 1 15:15:34 UTC 2010
In giving URIs for authors, publishers, and so on - a reference-able URI
in place of a simple literal value - we have a number of criteria:
1 - It has to be a unique URI, specific not only to a given literal
instance, but within a record.
2 - It has to be as globally unique as possible, as the URI provides a
hook on which services like sameas.org can function.
3 - The conjured ID should also be a valid URI - which places
limitations on character composition so that it is 'url safe' and so on.
------------
Possiblities then:
- Claim a URI prefix and have an auto-incrementing id generated for any
literal value or node you wish to reference. Eg:
_:Book a bibo:Book
dcterms:creator [ rdf:value "John S Smythe"] .
-- to ->
@prefix myprefix: <http://host/literals/>
_:Book a bibo:Book
dcterms:creator myprefix:102 .
myprefix:102 a foaf:Person
foaf:name "John S Smythe" .
Pros:
Nicer looking URIs
Easier URIs to copy and paste
Cons:
- Every source that does this has to mint their own prefix, as well as
making sure that the URIs dereference in some manner. Using an info:
based prefix might avoid this if resources are tight.
- Possibility of collisions unless the handing out of ids are handled
properly, for example using a Redis's INCR function, or Flickr's notable
use of a MySQL-specific SQL command [*]
- Unfounded attribution of significance to the literal number.
- Very difficult to re-run the process over a portion of data and get
the same numbers that were generated on a first pass.
* -
http://code.flickr.com/blog/2010/02/08/ticket-servers-distributed-unique-primary-keys-on-the-cheap/
Summary - Good for a single pass over a dataset, fast and as reliable as
your chosen ticket/id server.
UUID(4 or 5) ids:
previous example -- to -->
_:Book a bibo:Book
dcterms:creator <urn:uuid:63e8fa91..........> .
<urn:uuid:63e8fa91..........> a foaf:Person
foaf:name "John S Smythe" .
Pros:
Easy to generate
Given a sufficient source of entropy, the UUIDs can be generated
independently and are very, very unlikely to collide.
UUIDs from distributed systems can be aggregated with little fear of
collision - a collision would imply that the URIs are actually from the
same original source.
Cons:
Very large size - a size which gives them the ability to avoid random
collision also makes them difficult to use (from a human perspective)
Etc etc
More information about the openbiblio-dev
mailing list