2014-09-15

(This article was first published on Data Driven Security, and kindly contributed to R-bloggers)

During July I was working with a commercial data source that provides
extra data around IP addresses and it dawned on me: rather than pinging
billions of IP addresses and creating
map,
I could create a map from all the geolocation data I had at my finger
tips. At a high level I could answer “Where are all the IPv4 addresses
worldwide?” But in reality what I created was a map communicating “Where
does the geo-location services think all the IPv4 address are
worldwide?” And at the end of July I put together a plot in about an
hour and tossed it onto twitter. It is still getting retweets over a
month later in spite of the redundancy in the title.



Bob and I have talked quite a bit before about the (questionable) value
of maps and how they can be eye-catching, but they often lack the
substance to communicate a clear message. The problem may be compounded
when IP geolocation is the data source for maps. Hopefully I can point
out some of the issues in this post as we walk through how to gather and
map every IPv4 address in the world.

Step 2: Get the data

I already did step 1 by defining our goal and as a question it is,
“Where does the geo-location service think all the ipv4 addresses are
worldwide?” Step 2 then is getting data to support our research. When I
created the original map I used data from a commercial geolocation
service. Since most readers won’t have a subscription, we can reference
Maxmind and their free geolocation
data. Start by
downloading the “GeoLite City” database in
CSV/zip
format (28meg download) and unzip it to get the
“GeoLiteCity-Location.csv” file. Since the first line of the CSV is a
copyright statement, you have to read it in and skip 1 line. Because
this is quite a bit file, you should leverage the data.table command
fread()

Right away here, you can see some challenges with IP geolocation. There
are around 4.2 billion total IP address, 3.7 billion are routable (half
a billion are reserved) and yet the data only has a total of 557,986
unique rows. It’s probably a safe bet to say some of these may be
aggregated together.

You can jump right to a map here and plot the latitude/longitude in that
file, but to save processing time, you can remove duplicate points with
the unique function. Then load up a world map, and plot the points on it.



That’s interesting, and if you notice the alpha on the points is set to
1/10th, meaning it will take ten point on top of one another to make the
color solid (red in this case). One thing we didn’t do though is account
for the density of the IP addresses. Some of those points may have
thousands while others may have just a handful and the map doesn’t show
that. In order to account for that you have to load up the other file in
the zip file, the GeoLiteCity-Blocks file and merge it with the first
file loaded.

What you are looking at here is four columns, the begining and ending
address in an IP block with the latitude and longitude of that block.
The IP addresses are stored in long format, which is both easier to work
with and smaller for memory/storage of the data. We’ll make use of the
long format in a bit, for now we still have more clean up to do. Notice
the first line where begin and end are both NA? That either means
there were empty values in the CSV or the merge command didn’t have a
matching record for that location ID and because you set all to true
in the merge command above, it filled in the row with NA’s. The default
behavior is to drop any rows that aren’t in both tables, but we overrode
that by setting all=TRUE. We could take care of these NA’s but
removing the all from the merge command and accept the default of
FALSE for all. But this is interesting, because in our first plot we
just took all the latitude and longitude and plotted them… how many
don’t have corresponding IP address blocks?

430 thousand orphaned locations? That seems like a lot of unassociated
lat/long pairs, doesn’t it?

But keep going, you’ll want to do two more things with this data: 1)
count the number of IP’s in each block and 2) total up the number of
IP’s for each location. In order to do that efficiently from both a code
and time perspective we’ll leverage dplyr. Let’s clean up the NA’s
while we are at it.

Notice how we have 105,304 rows? That’s a far cry from the 557,986 rows
we had in the original latitude/longitude pairings you mapped.

Explore the data

What does the distribution of the counts look like? Chances are good
there is a heavy skew to the data. To create a plot where you can see
the distribution, you’ll have to change the axis showing the
distribution of addresses per lat/long pair to a logorithmic scale.



I would guess that the spikes are around and we can check by converting
the count field to a factor and running summary against it.

While that’s interesting, it’s not surprising or all that informative.

Back to the map, right now you have three variables, the latitude,
longitude and count of addreses at the location. Lat and long are easy
enough, those are points on a map. How do represent density at that
point? I think there are three viable options: color (hue), size or
opacity (color brightness). In my original plot, I leverage the alpha
setting on the points. Trying to use hue would just get jumbled together
since at the world view many of the points overlap and individual colors
would be impossible to see. with a hundred thousand+ points, size also
will overlap and be indistinguishable.

Since all we want to see is the relative density over the entire map,
the reader won’t care if there is a lot of IP addresses at a point or a
whole lot of IP addresses at point. Showing density is what’s important,
so let’s use the alpha (opacity) setting of the point to show density.
The alpha setting is a value between 0 and 1, and our counts are large
numbers with a heavy skew. To wrangle the range into an alpha setting we
should first take the log of the count and then scale it between 0 and
1. I chose to apply log twice to shift the skew the distribution so most
of the values are less than 0.5. Since the points overlap, this should
make a nice range of opacity for the points.

And now let’s map those!

And there you have it! There are several tweaks that could be done to
this. If you notice in this final map, I set the point size to be 0.3.
If you raise that up you can create a map that is very dense with color
and the size of the point is relative to the size of the output plot. If
you export at 4x6 a point size of 0.3 may be huge, but they may barely
show up if you export at 15x20. There is no set formula and you can play
around with the values, but just be sure the final product stays as
close to the data as possible.

A Final Thought

We talked about the “Potwin effect” in our book and Bob mentioned it in
his
Statebins
blog post as well. But if you notice some of the lat/long pairs are
rounded off to whole integers. That may be a good indication that the
only thing known about the geolocation of the IP address is the country.
Further work may be to remove or otherwise account for the uncertain
points by matching. Chances are good they are the only lat/long pairs
that are both whole numbers throughout the data.

To leave a comment for the author, please follow the link and comment on his blog: Data Driven Security.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Show more