Inkdroid.org

preserving linked data

2013-10-01

Earlier this morning Martin Malmsten of the National Library of Sweden asked an interesting question on Twitter:

Do you need help hosting your LOD somewhere else? Could be a valuable excercise in LOD stability http://t.co/FlOT5KqZXn @3windmills @edsu

— Martin Malmsten (@geckomarma)

September 30, 2013

Martin was asking about the Linked Open Data that the Library of Congress publishes, and how the potential shutdown of the US Federal Government could result in this data being unavailable. If you are interested, click through to the tweet and take a minute to scan the entire discussion.

Truth be told, I’m sure that many could live without the Library of Congress Subject Headings or Name Authority File for a day or two…or honestly even a month or two. It’s not like this data’s currency is essential to the functioning of society, like financial, weather or space data, etc. But Martin’s point is that it raises an interesting general question about the longevity of Linked Open Data, and how it could be made more persistent.

In case you are new to it, a key feature of Linked Data is that it uses the URL to allow a distributed database to grow organically on the Web. So, in practice, if you are building a database about books, and you need to describe the novel Moby Dick, your description doesn’t need to include everything about Herman Melville. Instead it can assert that the book was authored by an entity identified by the URL

http://id.loc.gov/authorities/names/n79006936

When you resolve that URL you can get back data about Herman Melville. For pragmatic reasons you may want to store some of that data locally in your database. But you don’t need to store all of it. If you suspect it has been updated, or need to pull down more data you simply fetch the URL again. But what if the website that minted that URL is no longer available? Or what if the website is still available but the original DNS registration expired, and someone is cybersquatting on it?

Admittedly some work has happened at the Dublin Core Metadata Initiative around the preservation of Linked Data vocabularies. The DCMI is taking a largely social approach to this problem, where vocabulary owners and memory institutions interact within the context of a particular trust framework centered on DNS. But the preservation of vocabularies (which are also identified with URLs) is really just a subset of the much larger problem of Web preservation more generally. Does web preservation have anything to offer for the preservation of Linked Data?

When reading the conversation Martin started I was reminded of a demo my colleague Chris Adams gave that showed how the World Digital Library item metadata can be retrieved from the Internet Archive. WDL embed item metadata as microdata in their HTML, and since the Internet Archive archives that HTML, you can get the metadata back from the Internet Archive.

So take this page from WDL:

It turns out this particular page has been archived by the Internet Archive 27 times. So with a few lines of Python you can use Internet Archive as a metadata service:

which yields:

Similarly you can get the LC Name Authority record for Herman Melville from the Internet Archive using the RDFa that is embedded embedded in the page:

which yields:

Since it is linked to directly from the HTML page, Internet Archive have also archived the RDF XML itself, and they actually even have the MARC XML, if that’s the sorta thing you are into.

But, as my previous post about perma.cc touched on, a solution to archiving something as important as the Web can’t realistically rely on a single point of failure (the Internet Archive). We can’t simply decide not to worry about archiving the Web because Brewster Kahle is taking care of it. We need lots of copies of linked data to keep stuff safe.

Fortunately, web archiving is going on at a variety of institutions. But if you have a URL for a webpage, how do you know what web archives have a copy? Do you have to go and ask each one? How do you even know which ones to ask? How do you ask?

…

The Memento project worked on aggregating the holdings of web archives in order to provide a single place to look up a URL for their Firefox plugin. They also ended up proxying some sources like Wikipedia that they couldn’t convince to support the Memento protocol. From what little I’ve heard about this process it was done in an ad-hoc fashion, leaning on personal relationships in the IIPC, and was fairly resource intensive, to the point where it was more efficient to use the sneakernet to get the data. If I’m misremembering that I trust someone will let me know.

Earlier this year, David Rosenthal posted some interesting ideas on how to publish the holdings of web archives so that it is not so expensive to aggregate them. His idea is basically for web archives to publish the hostnames of websites they archive instead of complete lists of URLs and all the times they have been fetched. An aggregator could collect these lists, and then provide hints to clients about what web archive a given URL might be found in. This would push the work of polling archives for a particular URL onto the client, which would receive a hint about what web archives to look in. It would also mean that there would space for more than one aggregator, since it wouldn’t be so resource intensive.

I really like Rosenthal’s idea, and hope that we will see a simple pattern for publishing the holdings of web archives that doesn’t turn running an aggregator service into an expensive problem. At the same time it’s important that the solution is simple, and that it’s not so complicated it becomes an onerous process that web archives don’t end up doing. It would be nice to see the bar lowered so that smaller institutions and even individuals could get in the game, not just national libraries. I also hope we can find a simple place to build a list of where these host inventories live, similar to the Name Assigning Authority Number (NAAN) registry that is used (and mirrored) as part of the ARK identifier scheme.