2015-05-23

An interesting piece, as ever, from Tim Davies (Slow down with the standards talk: it’s interoperability & information quality we should focus on) reflecting on the question of whether we need more standards, or better interoperability, in the world of (open) data publishing. Tim also links out to Friedrich Lindenberg’s warnings about 8 things you probably believe about your data standard, which usefully mock some of the claims often casually made about standards adoption.

My own take on many standards in the area is that conventions are the best we can hope for, and that even then they will be interpreted in variety of ways, which means you have to be forgiving when trying to read them. All manner of monstrosities have been published in the guise of being HTML or RSS, so the parsers had to do the best they could getting the mess into a consistent internal representation at the consumer side of the transaction. Publishers can help by testing that whatever they publish does appear to parse correctly with the current “industry standard” importers, ideally open code libraries. It’s then up to the application developers to decide which parser to use, or whether to write their own.

It’s all very well standardising your data interchange format, but the application developer will then want to work on that data using some other representation in a particular programming language. Even if you have a formal standard interchange format, and publishers stick to religiously and unambiguously, you will still get different parsers generating internal representations that the application code will work on that are potentially very different, and may even have different semantics. [I probably need to find some examples of that to back up that claim, don’t I?!;-)])

I also look at standards from the point of view of trying to get things done with tools that are out there. I don’t really care if a geojson feed is strictly conformant with any geojson standard that’s out there, I just need to know that something claimed to be published as as geojson works with whatever geojson parser the Leaflet Javascript library uses. I may get frustrated by the various horrors that are published using a CSV suffix, but if I can open it using pandas (a Python programming library), RStudio (an R programming environment) or OpenRefine (a data cleaning application), I can work with it.

At the data level, if councils published their spending data using the same columns and same number, character and data formats for those columns, it would make life aggregating those datasets mush easier. But even then, different councils use the same thing differently. Spending area codes, or directorate names are not necessarily standardised across councils, so just having a spending area code or directorate name column (similarly identified) in each release doesn’t necessarily help.

What is important is that data publishers are consistent in what they publish so that you can start to take into account their own local customs and work around those. Of course, internal consistency is also hard to achieve. Look down any local council spending data transaction log and you’ll find the same company described in several ways (J. Smith, J. Smith Ltd, JOHN SMITH LIMITED, and so on), some of which may match the way the same company is recorded by another council, some of which won’t…

Stories are told from the Enigma codebreaking days of how the wireless listeners could identify Morse code operators by the cadence and rhythm of their transmissions, as unique to them as any other personal signature (you know that the way you walk identifies you, right?). In open data land, I think I can spot a couple of different people entering transactions into local council spending transaction logs, where the systems aren’t using controlled vocabularies and selection box or dropdown list entry methods, but instead support free text entry… Which is say – even within a standard data format (a spending transaction schema) published using a conventional (though variously interpreted) document format (CSV) that nay be variously encoded (UTF-8, ASCII, Latin-1), the stuff in the data file may be all over the place…

An approach I have been working towards for my own use over the last year or so is to adopt a working environment for data wrangling and analysis based around the Python pandas programming library. It’s then up to me how to represent things internally within that environment, and how to get the data into that representation within that environment. The first challenge is getting the data in, the second getting it into a state where I can start to work with it, the third getting it into a state where I can start to normalise it and then aggregate it and/or combine it with other data sets.

So for example, I started doodling a wrapper for nomis and looking at importers for various development data sets. I have things call on the Food Standards Agency datasets (and when I get round to it, their API) and scrape reports from the CQC website, I download and dump Companies House data into a database, and have various scripts for calling out to various Linked Data endpoints.

Where different publishers use the same identifier schemes, I can trivially merge, join or link the various data elements. For approxi-matching, I run ad hoc reconciliation services.

All this is to say that at the end of the day, the world is messy and standardised things often aren’t. At the end of the day, integration occurs in your application, which is why it can be handy to be able to code a little, so you can whittle and fettle the data you’re presented with into a representation and form that you can work with. Wherever possible, I use libraries that claim to be able to parse particular standards and put the data into representations I can cope with, and then where data is published in various formats or standards, go for the option that I know has library support.

PS I realise this post stumbles all over the stack, from document formats (eg CSV) to data formats (or schema). But it’s also worth bearing in mind that just because two publishers use the same schema, you won’t necessarily be able to sensibly aggregate the datasets across all the columns (eg in spending data again, some council transaction codes may be parseable and include dates, accession based order numbers, department codes, others may be just be jumbles of numbers). And just because two things have the same name and the same semantics, doesn’t mean the format will be the same (2015-01-15, 15/1/15, 15 Jan 2015, etc etc)

Show more