Delivering Linked Data quickly
I’ve been involved in a couple of discussions in the last few days about speed of access to Linked Data and whether that will limit what can be done with it.
This was kicked off by some questions on Twitter by Greg Boutin. Greg was particularly interested in queries on distributed data, particularly data that changes frequently so that caching it locally may be difficult. He’s worried that this will be so slow that it is not practical.
Michael Hausenblas posted his investigations into use of HTTP caching by Linked Data publishers and found that only a small proportion use HTTP headers like ‘Last-modified’ and ‘ETag’. (For an explanation of how that stuff works, read this: “Things caches do”).
Another angle on caching is to keep your own copy of external datasets, but in that case you also need to update your local copy if the external data changes. There’s been an interesting thread on this on the LOD mailing list, started by George Kobilarov.
My feeling is that the approach of Linked Data to piggy-back on the proven-to-be-scalable architecture of the web is definitely the right way to go, though there is clearly work to do to make the best use of the tools available to us. Some kinds of applications are more suited to LD than others. But the web already offers a sound caching mechanism – as a first step we should be using that to better advantage.
Trackbacks
Use the following link to trackback from your own site:
http://www.webofdatablog.com/trackbacks?article_id=delivering-linked-data-quickly&day=25&month=11&year=2009

I'm the co-founder of 
I have been following this debate as well, and share Greg’s worries.
The rationalizations and assurances skirt the problem completely.
For even the most simple queries that would be processed in a distributed environment performance is going to be an issue. One can not wait 50ms-300ms for some step of processing to be brought back from remote node. So, the other way is to crawl the space and absorb the stuff centrally. This is unrealistic as everyone would have to do it and thus the web traffic would grind to a halt. So, no real time semantic web is possible with this approach. It all sounds very democratic at outset but it is unrealistic. Therefore, the only way is to have centralized clusters that would perform the semantic web utilization. Google crawls the web. It takes sometimes days before new stuff gets to its data base. The queries/searches are all run against centralized infrastructure.
Caching is a red herring, for how will you know that particular data item has expired? You would need to check every time! Then the check will take the time. Traditional caching is done not to solve latency problems but rather to avoid re transmission of fixed large items.
Therefore, This is not going to work – it is doomed to fail.
Pawel Lubczonok ThoughtExpress
Pawel
Thanks for your thoughts. If I understand correctly, you are talking about the idea of the semantic web as a kind of live distributed database. I think you are right that it is not feasible (not at the moment anyway) to use the web of data in that way: it doesn’t work like a single database that you can query and get back a rapid answer.
But not every application of the semantic web requires that kind of behaviour. It’s a different approach that enables some new and different kinds of applications – and we need to work with and exploit its strengths, rather than lament the fact that it doesn’t work like a global version of MySQL.
Both comments from both of you.
Bill, I agree we should focus on Linked Data’s strength, but there are 2 problems with that: - what are those strengths, in terms of benefits it delivers? I think it hasn’t made the case that it’s useful. The benefits of Linked Data have been very slow at materializing and the real-world applications are limited. As Andraz Tori and I highlighted in comments to my blog post (plug: here), adoption has been slow and this seems to reflect some inherent design problems. - the expectations created by many technologists in the Linked Data community have been to turn the web into a giant database. This is repeated over and over again, to this day. This is what Linked Data was created for is supposed to achieve. If there isn’t a clear line of sight to the realization of this vision, and worse, if there is a line of sight to its non-realization through the current LD design, then we ought to change the direction of our efforts. There is a lot of money going into Linked Data R&D today, and to speak in business terms I think we see increasingly that this money may yield a better ROI elsewhere.
Meant to say “Good comments from both of you” (first line)!
Bill,
Nice post as usual!
The solution is just as you articulate, HTTP is the underpinnings of the Web which by inheritance and implications means: Linked Data.
User Agents and Servers should have a conversation using the terms available from the HTTP protocol. Beyond that, reasongin using properties like “owl:sameAs” , “rdfs:seeAlso”, combined with VoiD data sets and the like will lead to a federated DBMS that exceeds any thing MySQL (and the like) could fantasize about.
As you know, we just need to get the following sorted out within the LOD community:
1. Auto-discovery patterns (via in addition to content negotiation + new response headers) 2. Use of HTTP’s cache related headers 3. VoiD data sets bound SPARQL endpoints 4. SPARQL extensions Vocabulary 5. SPARQL-FED extensions—how we Virtualize SPARQL endpoints 6. Delta-Engines (e.g. what you see with DBpedia Live but sprayed out in PubHubSubBub style via Atom Pub or something similar).
Latency isn’t going to be the problem, its the uniform appreciation and utilization of the items above that will take time to propagate, as best practices etc..
This is a good debate – opposing views can really add value.
HTTP etc. is the delivery protocol. I.e.how resultant content is delivered to users. Linked data is not a delivery protocol – it is content itself on which actions are taken – hopefully things like real time transactions – booking flights etc. If there is latency of 500ms in delivery of content to user we do not have a serious problem. However, if computations require access to large number of distributed data sets, computational time can explode as duration of each step can potentially be equal to the latency between nodes. This will make computing of anything except trivial too slow and costly.
RealTime Delta Engine is a good solution if the deltas are sent to limited number of other locations, if this is done by all distributed data sources to all consumers of this data the web will be overwhelmed.
The argument I am bringing up is about degree of centralization needed to achieve what is being proposed.
On a broader point, regarding the Semantic Web initiatives : Web of data (LD) is foundation for Semantic Web. However, I see significant lack of clarity in its development i.r.o. difference between semantics and information (data is even lower level) – these are significantly distinct ideas. It is my opinion that most of the current semantic web proposals are actually proposals for the information web and not the semantic web – thit is another debate though :-)
Pawel Lubczonok ThoughtExpress
Bill,
As per tweet, public DBpedia instance now add Cache directives to its HTTP response headers. Bearing in mind that we are dealing with the Static edition of DBpedia, we gone for 7 days re. expiration.
Re. DBpedia-Live [1], that’s going to be a different beast, so we have to deal with the factors in my initial response plus some consensus which protocol we use for delta notifications (as per ongoing LOD conversation).
Links:
1. http://dbpedia-live.openlinksw.com/stats/
Typo fix edition:
Bill,
As per tweet, public DBpedia instance now add Cache directives to its HTTP response headers. Bearing in mind that we are dealing with the Static edition of DBpedia, we gone for 7 days re. expiration.
Sample:
curl -I -H “Accept: text/html” http://dbpedia.org/page/Paris HTTP/1.1 200 OK Server: Virtuoso/06.00.3124 (Solaris) x86_64-sun-solaris2.10-64 VDB Connection: Keep-Alive Content-Type: text/html; charset=UTF-8 Date: Fri, 27 Nov 2009 15:34:38 GMT Accept-Ranges: bytes Expires: Fri, 04 Dec 2009 15:34:37 GMT Link: <http://dbpedia.org/data/Paris.rdf>; rel=”alternate”; title=”Metadata in RDF/XML format”, <http://dbpedia.org/data/Paris.n3>; rel=”alternate”; title=”Metadata in N3/Turtle format”, <http://dbpedia.org/data/Paris.json>; rel=”alternate”; title=”Metadata in JSON+RDF format” Content-Length: 1109746
Re. DBpedia-Live [1], that’s going to be a different beast, so we have to deal with the factors in my initial response plus some consensus re., which protocol we use for delta notifications (as per ongoing LOD conversation).
Links:
1. http://dbpedia-live.openlinksw.com/stats/
I don’t think Linked Data is supposed to be “the Google killer” and you can do a distributed SPARQL query across the entire cloud and get real-time results back. If that is your desire, I still think the search engine space would have to build indexes and caches of the content. If anything, Linked Data is a disruptive technology on traditional search engines because it makes the crawling and matching easier (in theory) if the sources are rich in semantics and accessible down to fine grained data levels.
But for the average linked data application, and by average I mean – let’s move away from the “core” of the cloud in the academic and scientific spaces and move to commercial applications. In these cases, you may want to have an LD site dedicated to providing some feature about music – artists, albums, tracks. Rather than rebuilding a new database and modeling all of that data, the linked data cloud can provide for you.
A combination of caching and best practices on structuring your linked data for fast queries should provide a strong platform to build such an application. Consider the music site that is going to add some more data about an artist that is dereferenced out of the LD cloud. Should that dereferencing happen in real-time, or should there be a cache? If you’re querying a manageable set of records, a cache + 1 or 2 50ms queries would seem acceptable. People already de-reference things on the web from URIs – loading javascript and images, where round trip time is still in the 10s of milisecond ranges. Of course you have to watch out for the number of external queries you make in real-time, but it can be managed.
This is essentially the resource oriented architecture of the web – defining a strategy on the resources you want to have immediately available, versus ones you will reference and load over time to provide a user experience.
To go further then this, I think we would have to think outside of the box – like what if the linked data cloud collaborated with the clients to a better degree, allowing the browser to remember de-referenced URI content, and letting the javascript re-reference it on demand as needed on different pages.
Or, what if specific SPARQL query results across the cloud were promoted to linked resources as well – eliminating some of the distributed processing overhead. You query for this hotel + this map constantly? Why not make it’s own resource, rather than a SPARQL query each time.
Al, great comment. I agree that we should be looking at how to use LD in a practical way to create and enable new types of applications.