Tetherless World group translate data.gov datasets to RDF
“The purpose of Data.gov is to increase public access to high value, machine readable datasets generated by the Executive Branch of the Federal Government.” That’s a laudable objective, and TW’s work in converting some of the available data to RDF has the potential to add significantly to the pool of Linked Data.
Making the whole process of publishing linked data easier is something that I’m very interested in and actively working on – so I spent a bit of time looking at how TW had gone about this process to see what I could learn.
The first thing that struck me is that there’s a lot of data there! As of July 2009, 16 RDF datasets have been created including about 3 billion triples. To get a handle on how the RDF was designed, I looked in more detail at one of the smallest datasets, Worldwide M1+ Earthquakes, past 7 days, with a mere 9430 triples. (Of course the size of the dataset is different on different days, depending on how busy the world’s earthquake zones have been – but it’s always small enough to easily load the whole thing in your browser).
The Tetherless team explain the principles they have followed in the translation work:
- Keep the translation minimal
- Let the translation meet the web
- Make the translation extensible
- Preserve knowledge provenance
Most of these make sense to me: preserving provenance is clearly essential and I like the idea of making the translation extensible – which is achieved by providing a wiki page for each property where users can add documentation.
However, I think the first principle, “Keep the translation minimal” will limit the value of what can be achieved. The approach is a commonly used one: map each row in a table to a URI, which is used as the subject of a set of RDF triples with the column as the property and the cell contents as the object. The property names are automatically generated from the column labels. The main reason for this seems to be to keep the workload manageable: “mapping to existing semantic properties could take non-trivial time”. That’s quite understandable, particularly given the very large quantities of data they are dealing with, but combining that with the simplistic structuring of the data means that the translation to RDF is more syntactic than semantic.
Taking the earthquake data set as an example, here’s the RDF for one row in the table, relating to one earthquake:
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dgtwc="http://data-gov.tw.rpi.edu/2009/data-gov-twc.rdf#" xmlns="http://data-gov.tw.rpi.edu/vocab/p/34/" xml:base="http://data-gov.tw.rpi.edu/raw/34/data-34.rdf"> <rdf:Description rdf:about="#entry00002"> <rdf:type rdf:resource="http://data-gov.tw.rpi.edu/2009/data-gov-twc.rdf#DataEntry"/> <src>nc</src> <eqid>71260916</eqid> <version>0</version> <datetime>Thursday, August 6, 2009 09:43:26 UTC</datetime> <lat>39.4313</lat> <lon>-123.1092</lon> <magnitude>2.0</magnitude> <depth>2.60</depth> <nst>12</nst> <region>Northern California</region> </rdf:Description> </rdf:RDF>
In addition there is an index file associated with the dataset, that relates each property to its wiki page, but there are no RDF links to external vocabularies or resources. (By looking at the website of the original data provider, the US Geological Survey, I was able to find out that the depth of the quake is measured in kilometres and that “nst” refers to the number of observing stations that reported the quake, but this information is not included in the RDF).
Think about what kind of questions this dataset might help to answer: for example, what earthquakes happened in California near location X in some time period Y? If the data could be structured to say that the subject of these triples is an earthquake, and to use “standard” (or at least widely used elsewhere) vocabularies for time, latitude, longitude, place names, then the job of a search engine or mashup application could be made much easier.
I don’t mean to be overly critical, as TW (and of course the data.gov initiative itself) are to be praised for getting some large public datasets up in semantic web formats. But to really get the value from Linked Data, I think we need to put the emphasis on the links. It will take more work, but perhaps more value can be created by a smaller number of data sets that have been intensively marked up.
I’m sure the TW team understand this very well, but have balanced up the need to process a lot of data in a reasonable time, with the amount of work they can put in to each one.
Update 7 August:
TW have produced some very nice interactive visualizations of some of the data.gov datasets, including the earthquake data discussed above. This shows one of the benefits of converting to RDF (regardless of how much external linking you do) and that is to allow you to use powerful standard tools like SPARQL to work with your data.