2015-05-03

Introduction

In the previous post we successfully queried the limited IPv4 range table in DynamoDb and found the geoname ID that belongs to a single IP. We used 3 available integer properties in the table to narrow down the number of records that had to be scanned to reduce the query execution time and the risk for exceptions.

In this post we’ll start the same process for the lng/lat coordinate range. More specifically we’ll prepare the raw data file that can be uploaded into DynamoDb through S3. The process will be very similar to what we saw in this post where we created the IPv4 range source file. It is good idea to quickly re-scan that post to remind you of the process.

Preparation

We’ll go with the same reduced Blocks-IPv4 CSV file we created in the post referred to above. Here’s a reminder:

I’ll go with the following range in the sample:

From the first record…

1.0.0.0/24,2077456,2077456,,0,0,,-27.0000,133.0000

…until the end of the 1.0.x range i.e….

1.0.255.0/24,1151254,1605651,,0,0,83110,7.9833,98.3667

The source file gives 275 records at the time of writing this post. I saved the file as IPv4-range-sample.csv.

Downloading the necessary libraries

Amazon have a demo library to demonstrate the lng-lat based geo-services through DynamoDb and we’ll reuse a lot of ideas from there. I’ve cloned the project from GitHub. There’s at least one very specific reason to do so. It contains two libraries that for some reason are not available in the Maven repository – at least I’ve been unable to locate them. They are located in the /lib folder:



Each folder includes a JAR file. If you’re working with a Maven project you’ll need to save these in your Maven repository manually. If you’re not sure how to do it here‘s a very short guide.

Furthermore, there’s an additional library called dynamo-geo-1.0.0.jar available in the root folder:



Insert that library in the Maven repository as well.

Preparing the Maven project

Before we can transform the MaxMind source file into a DynamoDb-friendly import file we need to get the JAR dependencies for the Maven project. You can already now install the AWS Java SDK although we’ll only use it later. You might even have it already from the previous posts.

Here’s the list of dependencies we’ll need for the lng/lat transformation and query process:

Transforming the IPv4 Block file

Actually we won’t transform the source file but rather read the necessary elements from it and create an input file for DynamoDb ready to be imported from S3. Here’s a short description of the steps we’re going to take:

Create the source file for DynamoDb based on the reduced IPv4 sample

Upload it to S3

Import it from S3 to DynamoDb using the built-in bulk insertion tool in DynamoDb

Before I present any code let’s see step by step what it will need to carry out:

Read the reduced CSV source file line by line

Extract the geoname, longitude and latitude values

Save the geoname, lng and lat combination in a list to avoid duplicates. We don’t want to store the same lng/lat combinations over and over again. A single city can have a long range of IPs and we want to get rid of duplicates in the database. Also, the DynamoDb import process will complain if it finds two identical records and the import process will fail. The code will omit proxy and satellite locations where the source file has no longitude and latitude data

If the combination is unique then we build a single row in the DynamoDb import file. We use the libraries referenced above to build a geo-hash and a hash key for our table. The range key will be simply the record counter – as long as it’s unique and can be converted into a string you’ll be fine, I went for the easiest option. The geo-hash number will be a result of a long and complex mathematical calculation that Google implemented in the S2 geometry library. It will be a large number that uniquely represents a coordinate pair

Use all those elements to build a string that can be attached to a DynamoDb-formatted JSON file. DynamoDb cannot just be fed any textual source file. It needs to clearly show the boundaries of each data record and its type. In this example we have numeric and string fields denoted by “n” and “s”. The individual elements must be delimited by end-of-text and start-of-text characters, denoted by 0x03 and 0x02

Insert the following code block into your project:

It’s probably best if you run that code in Debug mode and see what happens exactly. If everything goes well then you should have a .json file which if opened in Notepad++ should look as follows:



Note that the unique data record counter stopped at 42 for the reduced IPv4 range data source I selected above. So out of the first 275 rows in the current MaxMind CSV file 233 were filtered out as duplicates. You can expect the lng/lat range table to be much much smaller than the IPv4 range table after all duplicates have been removed. In our real-life case we have just over 134k records in our lng/lat range table compared to slightly above 10 million rows in the full IPv4 range table.

Upload the file to S3 within some folder. Make sure that this input file is the only object in that folder. You can already now create another empty folder called “logs” where the DynamoDb import process will send the log messages.

We’ll see how to upload these records into DynamoDb in the next post.

View all posts related to Amazon Web Services and Big Data here.

Show more