2016-03-15

Python for geospatial data processing

Python is undoubtedly one of the most popular, general purpose, programming languages today. There are many strong reasons for this but in my opinion the more important ones are: an Open Source Definition, the simplicity of its syntax, the batteries included philosophy and an awesome, global community.

One interesting example of an area where Python is being adopted massively is the scientific world. That explains the existence of communities like PyData or the Scipy ecosystem.

Another more specific area where I see an increasing interest and use of Python is for geospatial data processing. A proof of this are the many well known tools using it, such as GDAL, ArcGIS, GRASS, QGIS and more.

Then, the objective of this post is to show the advantages and power of the Python ecosystem in this particular ambit. I’m going to do this through an example of a complex task, which is typical in this field: satellite image classification.

Satellite Images Classification In Python

Satellite images are classified for an infinite number of reasons. It has uses in agriculture, geology, emergencies monitoring, surveillance, weather forecast, economical studies, social sciences and more.

For this reason, many GIS and geospatial data management systems include tools to perform classifications. But this approach has some limitations: the process is manual and you usually have a very small set of options regarding the classification technique and other hyperparameters.

Another possibility is to classify using implementations in domain-specific languages (“DSL”) such as R, IDL, MATLAB, Octave, etc. But in this case you are usually limited to an experimental context.

Python, thanks to its scripting features and rich ecosystem, provides most of the benefits of other DSLs so it’s great for doing research and quick prototyping. Moreover, being a widely adopted general purpose language, it is also useful to develop production-ready, efficient, maintainable, scalable, industrial scale classification systems.

Therefore, let’s see how easy it is to perform an image classification, making use of the tools existing in the Python ecosystem for geospatial data processing. In a hundred lines of code we are going to develop a script to:

process a Landsat 8 GeoTiff image (raster data),

extract training and testing data from Shapefiles (vector data)

train and classify using a modern Machine Learning technique

assess the results

To try and kiss to maintain the focus in the proper classification part of the issue, I’m not going to delve into the depths of data pre-processing (calibration, geographic transformations, etc.). I know that is a basic part of the job but it’s not where I want to expand now.

As usual, most of the magic will be done by the tools we are going to be using as main dependencies:

GDAL

GDAL is a translator library for raster and vector geospatial data formats. It presents a single raster abstract data model and vector abstract data model for all supported formats.

It is implemented in C/C++, so it is highly performant, and it provides Python bindings.

Installing this library may not be a trivial task, specially for those who are not very familiar with the process of installing Python dependencies. In any case, the GDAL site has got detailed instructions which are summarized in a README file in the code repository related to this post.

scikit-learn

scikit-learn is an open source machine learning library for Python. It features various classification, regression and clustering algorithms. It is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy.

It is largely written in Python, with some core algorithms written in Cython (using C/C++) to achieve performance.

Example data

Thanks to the GDAL api, the program that we are going to develop throughout this post works with many different kind of image formats and geographically corresponding vectors. But in case you don’t have a dataset at hand, you can download some sample data.

It includes part of a Landsat 8 image of an agricultural area and some synthetic vector data with samples of different crops. It is the kind of data used for precision farming’s products (for example, crop’s identification).



The file is a compressed data directory with three sub-directories:

image includes a Geotiff file with the crop of a L1T Landsat scene (LANDSAT 8, sensor OLI, path 229, row 81, 2016-01-19 19:14:02 UTC). Only bands 1 to 7 are present (Aerosol, VIS, NIR, SWIR. 30m resolution)

train has got shapefiles with vector data to be used for training. Each file defines a class, that is: all the points, poligons, etc. existing within a shapefile are used to define samples of one class. In our sample data we have 5 classes, named A, B, C, D and E (not too fancy, I know.)

test is similar to the test directory, but this samples are going to be used to verify the classification results.

Since this post will not focus on the results of the classification, I’m not going to go into any details about data quality, requirements or preparation. In case you want to know or discuss anything in this matter, please use the comments or send me an email.

Example program

Next, we will develop a script to classify geospatial data. A more pythonic and complete version of the program can be found in the repo. That version includes logging, docstrings, comments, pep-8, some error control and other good programming practices that we are not going to take into consideration in this post.

In the code repository you will also find the simpler version of the script, described in this post (here). Before you download or copy-paste these lines in a file or a Python interpreter, make sure that you install the following dependencies:

Preliminars

Now we are ready to code. So, first things first: we import our main dependencies and define a list of colors:

From the last import line you can see that we are going to classify using the Random Forest technique. Thanks to Scikit-learn, we could easily experiment with or compare many different classification techniques such as stochastic gradient descent, support vector machines, nearest neighbors, AdaBoost, etc.

The list of colors will be embedded in the GeoTiff output. This will allow you to easily visualize it in any standard image viewer program.

Next, let’s define some useful functions that we are going to be using later. They are making heavy use of the GDAL api to manipulate raster and vector data (the code is pretty self explanatory).

Now we have all that we need to perform the actual classification. Let’s create some variables to define our input and output:

In the lines above, I assume we are using the sample data described before.

Training

Now, we will use the GDAL api to read the input GeoTiff: extract the geographic information and transform the band’s data into a numpy array:

Next, we’ll process the training data: project all the vector data, in the training dataset, into a numpy array. Each class is assigned a label (a number between 1 and the total number of classes). If the value v in the position (i, j) of this new array is not zero, that means that the pixel (i, j) must be used as a training sample of class v.

training_samples is the list of pixels to be used for training. In our case, a pixel is a point in the 7-dimensional space of the bands.

training_labels is a list of class labels such that the i-th position indicates the class for i-th pixel in training_samples.

So now, we know what pixels of the input image must be used for training. Next, we instantiate a RandomForestClassifier from Scikit-learn.

There are many parameters that we can play around with here. I encourage you to read the related documentation and try different possibilities.

Normally, the fine tunning of these parameters depend on the data, the specific domain of study, memory and processing resources, expected accuracy, etc.

To stay focused and for the sake of simplicity, I’m not going to expand on this issue. As you can see, I’m only passing an option to use all the cores in my computer.

Classifying

And voila! believe it or not, that was the hard part. Now we have a trained model, able to classify (predict) the class of whatever pixels data we have. So let’s do that.

We used the trained object to classify all the input image. Our classifier knows how to train pixels and its predict function expects a list of pixels, not an NxM matrix. Because of that, we reshaped the bands data before and after the classification (so that the output looks like an image and not just a list of multi-dimensional pixels).

At this point, if you have matplotlib installed you can visualize the results:



But more important than merely watch our astounding, brand new classification is to save it to disk and asses its accuracy. So let’s do that.

For the first task, we already created an auxiliary function: write_geotiff (that you had already forgotten about, lost in the admiration of the colourful image). We could just save the pixel’s data using matplotlib.pyplot.imsave but then we would loose all the valuable geographic information (and other metadata) included in the GeoTiff format. And such information is essential for the GIS and other satellite data processing systems. So we’ll use our GDAL-powered function:

That was simple. Now you should be able to open that new file that we just created, with any image viewer, GIS or remote sensing data’s processing system.

Assess the results

Finally closer to the end, before we can verify our classification’s accuracy, we need to pre-process our testing dataset in a fashion similar to what we did with the training data:

There we have the expected label for the verification pixels, and the computed labels. So we can analyze the results. For that, our beloved scikit-learn provides many tools. So let’s use two of them

That should print something like this:

Next, for precission and accuracy:

Should print something like this:

Conclusions

In this post we developed a script that processes raster and vector data, performs a supervised classification using a sophisticated machine learning technique, visualized the output and assessed the results.

All of this in 100 lines of a general purpose, widely adopted language and using highly efficient tools.

At least for me, that proves the benefits and power of the Python ecosystem for geospatial data processing.

Unfortunately, because of NDAs, I cannot share more specific (and very interesting) real-life examples. Hopefully in the future I’ll be allowed to do so, in order to expand on the advantages and some cool Python tricks useful in this field.

Show more