What makes good linked data?

Readers of this blog probably take it for granted that publishing more linked data is a Good Thing. Linked data should follow the established design principles, but that still leaves many ways to represent your data as RDF. Which is the ‘best’ way? How do we decide if our approach is good or bad? How do we improve?

So here’s my first attempt at quality criteria for linked data.

1. Valid

Step 1 is obvious: the syntax has got to be correct. There are some handy services to check this for you, for example http://www.w3.org/RDF/Validator/ for RDF/XML and http://validator.w3.org/ for HTML+RDFa.

2. Accurate

The RDF representation of the information should say what the author meant to say. This is clearly important, but difficult to assess automatically. Common mistakes include mixing up a URI for something (France, say) and the URI for a document about that thing (eg the Wikipedia page on France); or using an inverse functional property (like foaf:homepage) inappropriately, causing reasoners to conclude falsely that two distinct things are the same.

3. Highly linked

The clue is in the name! Linked data should involve lots of links. Linking your information to related external information allows it to be used by search engines and data browsers. It produces a more highly connected overall graph, which can then be queried in more different ways.

4. Use existing resource identifiers where possible

Related to the recommendation to link to external data, you should use existing identifiers for resources, where suitable ones exist. If you use an existing identifier for France, http://dbpedia.org/resource/France for example, then your information on France can easily be linked to other related data. If you make up your own identifier then machines don’t know that it refers to the same thing. You can assert that using owl:sameAs, but that still adds another layer of complication. Stefano Mazzocchi explains the importance of relational density in RDF graphs.

5. Use existing vocabularies where possible

Using existing ontologies or vocabularies for the classes and properties in your data makes it easier for applications to process and easier to design queries that use it. Of course, we’re still in the early stages of the semantic web and for many purposes, no suitable ontology exists. If you have to create your own ontology you should publish it as OWL and also as HTML, documenting how it should be used. FOAF is a good example of this.

6. Document your data

OK so your data is fully readable by machines, but the people who direct those machines need to know what to point them at. The usefulness of many data sets could be greatly enhanced with a small amount of documentation for us humans, explaining what the data is all about, its provenance and context, the main classes involved, and providing some example SPARQL queries to get people started with exploring it.

Can anyone suggest improvements or additions to this list?

This entry was posted on Tue, 20 Oct 2009 09:15:00 GMT . You can follow any any response to this entry through the Atom feed. You can leave a comment or a trackback from your own site.


Trackbacks

Use the following link to trackback from your own site:
http://www.webofdatablog.com/trackbacks?article_id=what-makes-good-linked-data&day=20&month=10&year=2009

Comments

Leave a response

  1. Avatar
    Richard Cyganiak about 3 hours later:

    Bill, nice piece. I would add that internal links are just as important as external ones, so you should make those connections between your resources explicit.

    I disagree on the advice about re-using resource identifiers where possible—that’s one of those romantic old Semantic Web ideas, but turns out to be hopelessly naïve in practice. Identifiers are the bedrock of your dataset. For any nontrivial dataset, you have to use heuristic matching algorithms to find the external identifiers that connect to your data. Those heuristics will sometimes be wrong, picking the wrong identifier. Now, if you re-use the external identifiers, then all information about those mis-matched resource is wrong. If you manage your own identifier scheme, and use an owl:sameAs link, then only the link will be wrong, while your own dataset is still internally consistent. Also, all external identifier schemes change and morph and evolve, and you’ll keep chasing updates because external identifiers have changed. In my experience, keeping “publishing” and “external interlinking” as two separate steps avoids a ton of hassle.

    (Stefano’s piece is very insightful about the need for increased density in the LOD cloud, but misses the point that third parties can take existing RDF datasets, consolidate them using what he calls a-priori reconciliation, and re-publish the result as linked data. Think a hundred domain-specific Freebases that import “rough” linked data and re-publish consolidated linked data. )

  2. Avatar
    Bill Roberts about 3 hours later:

    Just come across this article by Ian Davis, posted earlier today, relevant to this discussion:

    More than the minimum

  3. Avatar
    Kingsley Idehen about 6 hours later:

    Bill,

    Nice post!

    I concur with Richard’s comments above re. Identifiers. This is basically what we are doing via the Proxy/Wrapper Identifiers (generic HTTP URIs) that we generate via the Virtuoso Sponger Middleware that’s integrated into our SPARQL engine and also offered to the public via the URIBurner service [1][2].

    Note the use of owl:sameAs and seeAlso to expose what we’ve found elsewhere, and if you get a little deeper into inference rules, you’ll notice how we unveil the power of some of the lesser understood aspects the Linked Data and OWL technology intersections.

    Links to tools that demonstrate the points above:

    1. http://uriburner.com 2. http://uriburner.com/fct - via navigation options you can enable inference-rules and see data reconciliation in action based on sponged/crawl data etc. 3. http://ode.openlinksw.com (which provides bookmarklets for talking to Sponger Instances associated with any Virtuoso instance).

    Kingsley

  4. Avatar
    Simon Grant 17 days later:

    Rather than owl:sameAs, have people considered skos:closeMatch and skos:exactMatch? (see http://www.w3.org/TR/skos-reference/#mapping ). Would those not be more flexible and appropriate, or am I missing something here?

    Simon

  5. Avatar
    Martin Brousseau 2 months later:

    Interesting post Bill. I am wondering why it’s so hard to find such best practices on the W3C and the LOD sites.

    I agree with Richard regarding URI persistence. This feature of the Semantic Web is called the Nonunique Naming Assumption. We have to assume that some Web resource might be referred to using different names by different people.

    Also, I think that URI aliases (owl:sameAs) should link to a resource that uses an RDF representation. HTML and other representation types should be stated using rdfs:seeAlso.

    In response to Simon, even if the SKOS does not enforce a domain/range for the object properties you refer to, I think they should be applied/reserved to resources of type skos:concept. sameAs is defined in the OWL schema and can be applied to any URI.

Leave a comment