Data as marketing
Via Nathan Yau’s Flowing Data blog, I came across this fascinating visualisation of Netflix data, published in the New York Times. It shows the top 50 most rented Netflix movies, zip code by zip code for several US cities.
It’s very nicely done and it’s very engaging. It made me think of Scott Brinker’s ideas on data web marketing:
“Data web marketing is about growing customer relationships, increasing the visibility of your firm, and building brand equity through the production and delivery of data.”
By giving their data to the New York Times (and maybe helping with creating the visualisation, I don’t know what the deal was), Netflix have created a huge amount of attention for their company and their brand: not only lots of NYT readers playing with the maps, but lots of blog posts like Nathan’s, Stephen Baker’s and this one talking about it. I think Scott is right that data web marketing is soon going to be a big deal.
Use RDFa, get more traffic
Martin Hepp’s work in creating, promoting and implementing the GoodRelations ontology for e-commerce reached another milestone this week, with the first concrete indications that including RDFa metadata in your page will improve its ranking on Google.
Martin and others have been talking about the potential SEO benefits of RDFa for a while, but it looks like that is now being realised by BestBuy through their use of GoodRelations. BestBuy has been carrying out a trial of adding GoodRelations mark-up to its stores and products pages (see this earlier post on the topic): Jay Myers of BestBuy presented some early results of this trial at the Search Engines Strategies 2009 conference this week. (Jay says on Twitter that his slides should be up on Slideshare soon).
Martin’s post on the Business of Linked Data mailing list reports:
- “GoodRelations + RDFa improved the rank of the respective pages in Google tremendously”
- “Jay also reported a 30 % percent (!) increase in traffic on the BestBuy stores pages”
- “Yahoo observes a 15% increase in the Click-through-Rate”.
It’s early days of course, but if these results prove to be repeatable elsewhere, this is likely to be a major shake-up of the SEO world. Better ranking on Google is not the only, or even the main reason to start publishing semantic data, but it’s a specific benefit that many people will quickly understand.
Delivering Linked Data quickly
I’ve been involved in a couple of discussions in the last few days about speed of access to Linked Data and whether that will limit what can be done with it.
This was kicked off by some questions on Twitter by Greg Boutin. Greg was particularly interested in queries on distributed data, particularly data that changes frequently so that caching it locally may be difficult. He’s worried that this will be so slow that it is not practical.
Michael Hausenblas posted his investigations into use of HTTP caching by Linked Data publishers and found that only a small proportion use HTTP headers like ‘Last-modified’ and ‘ETag’. (For an explanation of how that stuff works, read this: “Things caches do”).
Another angle on caching is to keep your own copy of external datasets, but in that case you also need to update your local copy if the external data changes. There’s been an interesting thread on this on the LOD mailing list, started by George Kobilarov.
My feeling is that the approach of Linked Data to piggy-back on the proven-to-be-scalable architecture of the web is definitely the right way to go, though there is clearly work to do to make the best use of the tools available to us. Some kinds of applications are more suited to LD than others. But the web already offers a sound caching mechanism – as a first step we should be using that to better advantage.
Need for better linked data practices
A few weeks ago I asked on this blog “What makes good linked data?”. A recent blog post by Mike Bergman and Frédérick Giasson really helps to get to the heart of this question.
If you are interested in how to use linked data effectively you should definitely read Mike and Fred’s article. To summarise briefly , they look at two specific high profile examples of linked data: the Rensellaer Polytechnic Tetherless World group’s work on converting data.gov datasets to RDF, and the recent New York Times initiative to open up their topic pages as RDF.
The main problem with the TW data is that their approach is a simple mapping from rows and columns of a table to RDF statements, with limited further information about the meaning of predicates. They don’t dig into the semantics of the contents of table cells, which is sometimes not obvious. I had a look at these issues a couple of months back in these two articles, though Mike and Fred examine it in more depth.
With the New York Times data, as Mike and Fred explain, the problem is essentially one of confusing information about a person with information on articles about that person, the old information vs non-information resource dilemma. Richard Cyganiak also wrote about this a few weeks ago.
Hopefully the efforts of the Pedantic Web initiative can help spread good practices in this area. (Let’s hope it’s not long before Galway dries up enough that they can turn their servers back on!).
What are the benefits of the semantic web to publishers?
This interesting question was asked yesterday by Stuart Myles on Semantic Overflow. I spent a fair bit of time trying to answer it, so I thought I would repost my thoughts here.
Stuart’s call to arms was “Convince me (and others) to join the Linked Data cloud. Sell me on the benefits!”.
Here’s my response:
OK I’ll have a go at convincing you. What do you want as a publisher? Presumably you want as many as people as possible to read what you are publishing and to do that, you need to make it interesting or useful, preferably both. Suppose you are the publisher of “Gardener’s Monthly” and as well as all your great articles, you’ve accumulated over time a heap of useful data about which plants grow well in different climates or soil types, how big they get, what colour the flowers are, what diseases they are susceptible to etc (I don’t know much about gardening, so my example will probably not be very realistic!)
This enables you to provide all kinds of useful stuff that your readers want to know. Maybe Fred thinks he fancies planting a shrub in the corner of his garden, but he doesn’t know what kind to get. He’s got a small garden so he wants one that won’t get bigger than 3 feet tall and he likes purple flowers. You have all the information he needs, so how do you provide it to him, without him having to read through all the back issues one by one?
Of course Fred isn’t going to crank up curl and start dereferencing your URIs. Someone has to take this data and present it in a human friendly way. Maybe that’s you as the owner of this great shrubbery database, maybe it’s a plant retailer who decides to aggregate data from multiple sites. But someone can build some kind of browsing or search interface that helps Fred find his answer. Once he’s narrowed his choice down to a rhododendron or an azalea, he can follow the links (rdf:seeAlso or whatever) back to regular articles on your site to find out more detailed information (even if ShrubSearch is operated by someone else).
So you get to be known as the best source of data in your domain, you get lots of readers, you make millions in advertising from the seed and fertiliser companies. Job done.
There are examples of companies making this work in practice with ‘regular’ data – for example IMDB. If I want to know something about a movie or an actor, I usually go straight to imdb.com, rather than to google or wikipedia. The advantage of doing this with linked data is that you don’t necessarily have to build the user interface to the data yourself to get the benefits (though it might be a good idea to do that) – your data might be used in all sorts of unexpected ways, and by appropriate links and ‘owning’ the URIs for key things in your field, you still draw in readers. And you can merge your own data with other people’s – you can pull in data from the National Plant Disease Research Centre (made up) or whatever to provide a better service to your readers.
There are lots of things that publishers can do on the web that they couldn’t do on paper – the successful publishers of the future will be the ones that recognise and exploit the new possibilities.
Let me know if that has helped persuade you!
There was another interesting answer to the question from Ian Davis, which is also definitely worth a read.
What makes good linked data?
Readers of this blog probably take it for granted that publishing more linked data is a Good Thing. Linked data should follow the established design principles, but that still leaves many ways to represent your data as RDF. Which is the ‘best’ way? How do we decide if our approach is good or bad? How do we improve?
So here’s my first attempt at quality criteria for linked data.
1. Valid
Step 1 is obvious: the syntax has got to be correct. There are some handy services to check this for you, for example http://www.w3.org/RDF/Validator/ for RDF/XML and http://validator.w3.org/ for HTML+RDFa.
2. Accurate
The RDF representation of the information should say what the author meant to say. This is clearly important, but difficult to assess automatically. Common mistakes include mixing up a URI for something (France, say) and the URI for a document about that thing (eg the Wikipedia page on France); or using an inverse functional property (like foaf:homepage) inappropriately, causing reasoners to conclude falsely that two distinct things are the same.
3. Highly linked
The clue is in the name! Linked data should involve lots of links. Linking your information to related external information allows it to be used by search engines and data browsers. It produces a more highly connected overall graph, which can then be queried in more different ways.
4. Use existing resource identifiers where possible
Related to the recommendation to link to external data, you should use existing identifiers for resources, where suitable ones exist. If you use an existing identifier for France, http://dbpedia.org/resource/France for example, then your information on France can easily be linked to other related data. If you make up your own identifier then machines don’t know that it refers to the same thing. You can assert that using owl:sameAs, but that still adds another layer of complication. Stefano Mazzocchi explains the importance of relational density in RDF graphs.
5. Use existing vocabularies where possible
Using existing ontologies or vocabularies for the classes and properties in your data makes it easier for applications to process and easier to design queries that use it. Of course, we’re still in the early stages of the semantic web and for many purposes, no suitable ontology exists. If you have to create your own ontology you should publish it as OWL and also as HTML, documenting how it should be used. FOAF is a good example of this.
6. Document your data
OK so your data is fully readable by machines, but the people who direct those machines need to know what to point them at. The usefulness of many data sets could be greatly enhanced with a small amount of documentation for us humans, explaining what the data is all about, its provenance and context, the main classes involved, and providing some example SPARQL queries to get people started with exploring it.
Can anyone suggest improvements or additions to this list?
Semantic web going mainstream?
Yesterday Jim Hendler’s post “It’s just a matter of semantics” post was published by the CNN Brainstorm Tech blog. As Hendler said on Twitter, it was “edited in odd ways but not totally butchered”.
A few weeks back ReadWriteWeb identified structured data (with semantic web at the core of it) as one of the 5 big technology trends of 2009. Maybe you wouldn’t call RWW mainstream, though it’s much more widely read than the CNN blog. The significance of the CNN post is that the CNN guys obviously think this is something that is relevant for their more general interest readership.
It’s still a challenge for us in the semweb community to explain clearly to lay people what it’s all about and why it’s important. I think Hendler’’s piece makes a good stab at it, but we need to keep working on it. I liked Benjamin Nowack’’s suggestions on the LOD mailing list recently, especially how to explain linked data to Oprah: “Think Tom Cruise, but double the excitement.” :-)
Got some good examples of explaining the semantic web to ‘normal people’? Post a link in the comments.
OpenPSI helps UK government get more data online
OpenPSI is a new project set up recently by the University of Southampton and the UK government Office of Public Sector Information (OPSI). Thanks to John Darlington of Southampton Uni School of Electronics and Computer Science for bringing it to my attention.
OpenPSI aims both to assist government data owners in getting their data online in semantic web formats and to connect with potential users of that data. Their approach makes a lot of sense: by helping with both the supply of data and the exploitation of it, their efforts should assist in generating some real benefits from public sector data. Demonstrating what can be done should be a big help in encouraging more data to be made available and more applications to be created.
OpenPSI provides SPARQL access to what will hopefully become a large corpus of UK public data – and importantly gives the ability to query across multiple data sets.
John Sheridan, Head of e-Services at the OPSI is a keen advocate of open data and the semantic web. In this July 2009 Talis podcast with Paul Miller, he explains his vision of the benefits that open government data can bring and the hands-on approach of his department in helping to make it happen.
And there still seems to be top-level support for this process: Gordon Brown recently met with Tim Berners-Lee and Nigel Shadbolt, Professor of AI at Southampton Uni (also CTO of online identity and semweb innovators Garlik), continuing to push the message of using open data to bring greater transparency and efficiency to government. I’m glad to say the message seems to be getting through.
Publishing table-based data as RDFa
Tables of data are a common feature of reports and blogs, so they represent an important use case for getting RDF online. The simplest and most commonly used approach is to assign a URI to each row in the table. That URI forms the subject of a set of triples, with the column name as predicate and the cell contents as the value. This works fine in some simple cases, but many tables are not so simple.
My main concern was to investigate the options for the RDF design of the data and I could have published a standalone RDF file, but I like the self-contained nature of the RDFa approach, where the RDF and HTML representations are both in the same web page. So my experiment was to publish a table of data as HTML and RDFa.
I’m a fan of the always interesting Guardian datablog so that was a natural place to look for some sample data. When I started looking at this a couple of weeks ago, the article that day was by Simon Rogers on Carbon Dioxide emissions and the accompanying data had some interesting features.
The dataset consists of carbon dioxide emissions for each country in the world, year by year from 1980 to 2006. Each data point is a physical quantity, with an associated unit. This pattern of time-varying data is extremely common, but not entirely straightforward to handle in RDF – so I think it makes an interesting case.
I selected a small part of the data (the first six countries and the first three years), to keep things simple and to respect the licence for the data. (Hopefully this little snippet counts as “fair use”!)
Time varying data
So how can we best represent this data in RDF?
Taking the simple approach of one subject per row, one predicate per column doesn’t really work with this kind of data structure. DBPedia typically represents this type of information with a pair of properties. For example the GDP of France is specified by the properties “gdpNominal” to tell you it is a measure of GDP and “gdpNominalYear” to tell you the year. However, if you have data for more than one year, then this approach no longer works.
Another approach could be to include the year information in the property name, such as “CO2emissions1980”, “CO2emissions1981” etc, but that leads to a large number of properties that are not very reusable. And in this case we want to specify a unit too.
So some kind of N-ary relation is required. Ian Davis recently published a series of articles reviewing and comparing the different options for representing time in RDF. One of those options is using N-ary relations and that is the approach I decided to take.
I defined a class (in my own namespace) called SpaceAndTimeDependentObservation. (At some point I’ll do the extra work required to create a small ontology around this class – I haven’t done that yet). Each cell in the table becomes an instance of that class, with a location (the country) and time (the year) associated with it, as well as a property, CO2emissions, whose value is a quantity with an amount and a unit. So each table cell becomes a graph fragment that looks like this:

(I’ve used an ellipse to represent a resource and a rectangle to represent a literal value. I’ve left out the namespace prefixes to keep it simple. The ellipses with no text in them are blank nodes.)
Countries
To follow good Linked Data practice, it makes sense to use existing URIs for countries, allowing the data here to be connected to other information about those countries. The two obvious choices here are DBpedia and Geonames. The DBpedia URIs are more ‘readable’ but the Geonames URIs are more closely linked to all kinds of other useful geographical data through the Geonames database, so I decided to use Geonames.
Quantities and units
For the CO2 emissions data themselves, it was important to specify a unit alongside the numbers. I used a ‘quantity’ blank node with an rdf:value and a unit. For now I just named the unit in my own namespace, but a better approach would be to use an established units ontology such as QUDT or SWEET.
Times
There are a few ontologies to choose from for describing time and time intervals. In this case I decided to mint my own property, (imaginatively called ‘time’) because I wanted its meaning to be tied to my SpaceAndTimeDependentObservation class. I annotated the year values (1980 etc) with the xsd:gYear datatype.
Putting it all together in RDFa
The first thing to do was to change the DOCTYPE for this blog to ”-//W3C//DTD XHTML+RDFa 1.0//EN”, so that the RDFa markup will be interpreted correctly. The data itself is a straightforward HTML table. Within each <td> element is a selection of divs to hold all the RDFa markup, with the only “visible” content being the data from the original Guardian spreadsheet.
A sample cell of the table looks like this:
<td>
<div typeof="w:SpaceAndTimeDependentObservation">
<div rel="w:CO2emissions">
<div property="rdf:value" datatype="xsd:float">0.53</div>
<div property="w:unit" content="million metric tonnes"></div>
</div>
<div rel="w:location" resource="http://sws.geonames.org/3573345/"></div>
<div property="w:time" content="1980" datatype="xsd:gYear"></div>
</div>
</td>
Do a “view source” on this page to see the full story. One thing to note is that this approach is quite verbose, adding about 300 characters of markup to each cell of actual data. I ended up with 6 triples per cell in order to represent reasonably precisely the data from the original table.
One tool I found very handy while writing all this stuff was Mark Birbeck’s Ubiquity RDFa parser bookmarklet. I recommend you give it a try.
The final result
So here it is: my RDFa marked-up HTML table representing the original data. Useful further work would be to add some additional metadata on the authorship and provenance of the dataset, but I’ll save that for another day.
One thing that stands out through this whole process is that there are many design choices to be made when deciding how to represent data as RDF. I’d be keen to hear what you think of the approach I’ve described here and whether you would do it differently.
| 2006 World Ranking | Country | 1980 | 1981 | 1982 |
|---|---|---|---|---|
|
176
|
Bermuda
|
0.53
|
0.46
|
0.49
|
|
7
|
Canada
|
458.35
|
443.01
|
424.26
|
|
179
|
Greenland
|
0.01
|
0.01
|
0.00
|
|
13
|
Mexico
|
240.43
|
266.61
|
282.25
|
|
207
|
Saint Pierre and Miquelon
|
0.15
|
0.15
|
0.13
|
|
2
|
United States
|
4788.65
|
4666.19
|
4421.14
|
|
North America
|
5488.11
|
5376.43
|
5128.28
|
More from the Tetherless World
In a happy coincidence, in Paul Miller’s latest podcast (released today), his guests are Jim Hendler and Li Ding from the Rensselaer Polytechnic Institute. Jim and Li explain the background and their thinking on the process of creating RDF versions of the data.gov datasets, that I wrote about last week.
Paul was kind enough to ask them a question I suggested, based on my previous post: why they chose to “keep the translation minimal”, meaning that the structure of the RDF data closely matches that of the original table, and the resources and properties they generate are not linked to external URIs or vocabularies.
The answer (starting about 28:45 in the podcast) is partly that this simply takes a lot of effort, and since they have just started this process of RDFizing the datasets they wanted to keep it simple. The other aspect of the response is more complex: the data.gov datasets tend to come with minimal documentation and the RPI team are not directly in touch with the data creators. Without a data dictionary to go with each dataset, there is a danger of making wrong assumptions about what the data is intended to represent, and so choosing an inappropriate property URI from another vocabulary.
Having said that, they are investigating the use of more in-depth markup for some datasets, including links to external ontologies like the W3C WGS84 lat/long vocabulary, as well as linking to existing URIs, for example the DBpedia entries for various government agencies.

I'm the co-founder of 