RDF datasets and graphs
I think we need to establish a common practice to link RDF datasets and named graphs.
I’ve been working recently on discovery of linked data: how people can find out easily what is available and how they can use it, either directly or by building it into new applications.
The dataset is clearly an important concept in this, albeit a rather vague one – essentially it’s a bunch of data that belong together somehow.
As most of you will already know, an ever increasing list of UK government datasets is catalogued at http://data.gov.uk/data, each with some supporting information on what it’s about, where it came from and where to go to access all the details.
Some of that data has been made available as Linked Data, with dereferenceable URIs and SPARQL endpoints, for example the information on schools available here. The Linked Data approach can be very powerful, but for a ‘data consumer’ it can also be difficult to know where to start. One of the things I’m currently working on is to publish some simple additional information that will hopefully make it easier to exploit the Linked Data part of data.gov.uk.
Which brings me back to describing Linked Data datasets. The de facto standard for this (and only show in town) is voiD which defines a vocabulary and recommends some good practices for describing a dataset.
Via the void:dataDump property, you can point to a location where you can get a copy of all the data in the dataset. And using dcterms:isPartOf, you can link the description of a resource back to the dataset that it’s part of.
OK, so far so good. However, one important thing that seems to be missing in this picture is how to restrict a SPARQL query to a particular dataset or shortlist of datasets.
The typical approach with statistical data has been to use SCOVO, soon to be usurped by the RDF Data Cube vocabulary from Dave Reynolds et al (still work in progress but hopefully soon to be released). Those approaches link individual observations back to the dataset they belong to, which makes it easy to limit a query to a particular dataset. But not all data fits that kind of pattern.
However this is exactly the reason that the named graph approach was created – as a convenient way of grouping a bunch of triples together and letting us talk about them – and it’s already supported by most RDF databases/quad stores.
VoiD lets you link a dataset to a SPARQL endpoint, but the voiD guide says “Note: It is assumed that the default graph of the SPARQL endpoint is the dataset itself”. This seems unnecessarily restrictive as it implies only one dataset per endpoint and a lot of the value of SPARQL and Linked Data in general is the ability to connect stuff across multiple datasets.
We can link named graphs to voiD datasets (by using dcterms:isPartOf for example) and if the data available through SPARQL endpoints is grouped into those named graphs, then it provides an easy mechanism for adding metadata and finding aids for the steadily increasing list of linked data datasets.