What makes good linked data?
Readers of this blog probably take it for granted that publishing more linked data is a Good Thing. Linked data should follow the established design principles, but that still leaves many ways to represent your data as RDF. Which is the ‘best’ way? How do we decide if our approach is good or bad? How do we improve?
So here’s my first attempt at quality criteria for linked data.
Step 1 is obvious: the syntax has got to be correct. There are some handy services to check this for you, for example http://www.w3.org/RDF/Validator/ for RDF/XML and http://validator.w3.org/ for HTML+RDFa.
The RDF representation of the information should say what the author meant to say. This is clearly important, but difficult to assess automatically. Common mistakes include mixing up a URI for something (France, say) and the URI for a document about that thing (eg the Wikipedia page on France); or using an inverse functional property (like foaf:homepage) inappropriately, causing reasoners to conclude falsely that two distinct things are the same.
3. Highly linked
The clue is in the name! Linked data should involve lots of links. Linking your information to related external information allows it to be used by search engines and data browsers. It produces a more highly connected overall graph, which can then be queried in more different ways.
4. Use existing resource identifiers where possible
Related to the recommendation to link to external data, you should use existing identifiers for resources, where suitable ones exist. If you use an existing identifier for France, http://dbpedia.org/resource/France for example, then your information on France can easily be linked to other related data. If you make up your own identifier then machines don’t know that it refers to the same thing. You can assert that using owl:sameAs, but that still adds another layer of complication. Stefano Mazzocchi explains the importance of relational density in RDF graphs.
5. Use existing vocabularies where possible
Using existing ontologies or vocabularies for the classes and properties in your data makes it easier for applications to process and easier to design queries that use it. Of course, we’re still in the early stages of the semantic web and for many purposes, no suitable ontology exists. If you have to create your own ontology you should publish it as OWL and also as HTML, documenting how it should be used. FOAF is a good example of this.
6. Document your data
OK so your data is fully readable by machines, but the people who direct those machines need to know what to point them at. The usefulness of many data sets could be greatly enhanced with a small amount of documentation for us humans, explaining what the data is all about, its provenance and context, the main classes involved, and providing some example SPARQL queries to get people started with exploring it.
Can anyone suggest improvements or additions to this list?