2014-05-24

The Story So Far ...

I'm working on a project to analyze OSM contribution history in the United States. One way to do this is to use the changeset dumps -- the changeset dumps contain fields for the username of the contributor, the time of the contribution and the bounding-box of all the edits made in a particular changeset. Using this data, and approximating the location of the edit as the center of the bounding box it is possible to do a lot of analysis about how contribution activity has changed and evolved in different regions around the country. My previous diary entry is one example of such an analysis.

History Files Here We Come!

However the approach of analyzing changesets comes quickly to a dead end if you want to understand the type of contributions. What are people actually doing when they edit in a certain area? Are they adding new subdivisions (likely to be lots of ways and "highway" tags here), are they adding useful metadata to existing streets (lots of maxspeed and oneway tags here) or are they adding POIs (amenity tags) and natural features like water bodies, hiking trails etc?

The changeset files are absolutely useless to understand this kind of activity. If you want to do such an analysis you have to look to the history files. History files are exactly like planet XML files, but with every past version of a particular node, way or relation also recorded. Like the planet XML, each feature comes with a changeset id, so you can reconstruct changesets from this file, and then look into what is actuall going on in the data.

Great! So how do we do this exactly?

The one problem with using history files is that a number of traditional tools do not work with this file format. Osmfilter which is great for filtering certain feature types does not work with .osh files and neither does Osmosis -- both tools I was using extensively in my previous work.

One way to go is to use Osmconvert to convert the entire .osh file to csv, and then manipulate this file using standard command line tools like sed. Unfortunately I this approach scales poorly -- I was using the history file for all of the US, and the csv version of this file can get pretty large, pretty quickly.

So, how do we get there? Enter Osmium! Osmium is a wonderful tool written by Jochen Topf that provides C++ libraries and header files to work with OSM data. Yes, your read that right C++, finally it was time for me to leave my comfortable world of python and dynamic typing and try to remember the C++ back in engineering school a few years back!

Fortunatately, osmium ships with some nice examples and very helpful community members that makes it not terribly hard to pick up! What's more -- Osmium has recently be redesigned and now comes with a nice shiny new website and some nice documentation.

Exact Steps to Osmium Glory!

So using all those as helpers, this is how I proceeded:

First, thanks to wonderful work by MaZderMind there are a number of "history extracts" available so that you dont have to begin with the planet history file. I downloaded the north america extract to start with

Then using MaZderMind's wonderful OSM history splitter (which is based on Osmium) and a bounding box for the continental USA ((-124.848974, 24.396308) - (-66.885444, 49.384358)) I created an extract for my analysis. Installing OSM history splitter is fairly straightforward if you follow the instructions on the github page.

Then, in order to extract relevant elements from the history file I turned to Osmium. The original plan was to use OSM History importer which is another tool based on Osmium, to import all of the data into a PostGIS database and then run queries on this data -- but given the nature of my requirement, I thought this was overkill. Installing Osmium was fairly straightforward for me (although I've read that others have had trouble) -- I installed the debian packaged versions of all requirements listed here and then git clones the repo and I was set!

And voila! Here is the script that extracts amenities from the history file and extracts highways and records the lat / lon for each (for the highways it records the lat /lon of the first node).

I took me a while to understand how Osmium works, so I thought I would make a few notes to help others out.

First, you must remember that Osmium is a header-only library project. There are no executables that come with libosmium, but you should definitely play around with osmium-contrib -- for my particular requirement I found this program to be very helpful.

Second, osmium comes with a program called osmium-tool that can do a limited number of things. Your requirement might actually be satisfied by one of these pre-coded tools, so you should be all set! Look at the usage here

I order to understand my script, you need to understand a few things. Osmium can read a large number of osm-related file formats. So I create a reader object that reads the file I'm interested in. Then osm:apply interates through all the objects in the file (i.e. nodes, ways and relations) and for each object calls the "handlers". In my case, I have a "location handler" that reads in the locations of all the nodes and associates them with the way (In OSM, the nodes have coordinates, while ways only have node references) -- and once the location handler is called, I call the "names handler". The names handler calls the "node" function for all nodes and "way" function for all ways. In these functions I include logic for what I want to do with the data, in this case extract features that have the relevant tags and write them to stdout.

Here are links to some more documentation to help with your Osmium Project.

Florian Ledermann's article

Video overview on Osmium

API documentation

Osmium is an extremely powerful (and fast!) to do lots of amazing things with OSM data. In fact my favorite Taginfo used Osmium on the backend. I'd highly recommend it as a tool for any heavy duty history file processing!

Show more