Last summer I wrote a post

comparing the performance of Numba and Cython

for optimizing array-based computation. Since posting, the page has received thousands of hits,

and resulted in a number of interesting discussions.

But in the meantime, the Numba package has come a long way both in its interface and its

performance.

Here I want to revisit those timing comparisons with a more recent Numba release, using the newer

and more convenient autojit syntax, and also add in a few additional benchmarks for

completeness. I've also written this post entirely within an IPython notebook, so it can be

easily downloaded and modified.

As before, I'll use a pairwise distance function. This will take an array representing

M points in N dimensions, and return the M x M matrix of pairwise distances.

This is a nice test function for a few reasons. First of all, it's a very clean and

well-defined test. Second of all, it illustrates the kind of array-based operation that

is common in statistics, datamining, and machine learning. Third, it is a function that

results in large memory consumption if the standard numpy broadcasting approach is used

(it requires a temporary array containing M * M * N elements), making it a good

candidate for an alternate approach.

We'll start by defining the array which we'll use for the benchmarks: one thousand points in

three dimensions.

In [1]:

Numpy Function With Broadcasting

We'll start with a typical numpy broadcasting approach to this problem. Numpy

broadcasting is an abstraction that allows loops over array indices to be

executed in compiled C. For many applications, this is extremely fast and efficient.

Unfortunately, there is a problem with broadcasting approaches that comes up here:

it ends up allocating hidden temporary arrays which can eat up memory and cause

computational overhead. Nevertheless, it's a good comparison to have. The function

looks like this:

In [2]:

Pure Python Function

A loop-based solution avoids the overhead associated with temporary arrays,

and can be written like this:

In [3]:

As we see, it is over 100 times slower than the numpy broadcasting approach!

This is due to Python's dynamic type checking, which can drastically slow down

nested loops. With these two solutions, we're left with a tradeoff between

efficiency of computation and efficiency of memory usage. This is where tools

like Numba and Cython become vital

I should note that there exist alternative Python interpreters which improve

on the computational inefficiency of the Python run-time, one of which is

the popular PyPy project.

PyPy is extremely interesting. However,

it's currently all but useless for scientific applications,

because it does not support NumPy, and by extension cannot run

code based on SciPy, scikit-learn, matplotlib, or virtually any other

package that makes Python a useful tool for scientific computing.

For that reason, I won't consider PyPy here.

Numba Wrapper

Numba is an LLVM compiler for python code, which

allows code written in Python to be converted to highly efficient compiled code

in real-time. Due to its dependencies, compiling it can be a challenge. To experiment

with Numba, I recommend using a local installation of Anaconda,

the free cross-platform Python distribution which includes Numba and all its prerequisites

within a single easy-to-install package.

Numba is extremely simple to use. We just wrap our python function with autojit (JIT stands

for "just in time" compilation) to automatically create an efficient, compiled version of the function:

In [4]:

Adding this simple expression speeds up our execution by over a factor of over 1400!

For those keeping track, this is about 50% faster than the version of Numba that

I tested last August on the same machine.

Optimized Cython Function

Cython is another package which is built to convert Python-like statemets

into compiled code. The language is actually a superset of Python which acts as a sort of

hybrid between Python and C. By adding type annotations to Python code and running

it through the Cython interpreter, we obtain fast compiled code. Here is a

highly-optimized Cython version of the pairwise distance function, which we compile

using IPython's Cython magic:

In [5]:

In [6]:

In [7]:

The Cython version, despite all the optimization, is a few percent slower than

the result of the simple Numba decorator! I should emphasize here that I have

years of experience with Cython, and in this function I've used every Cython

optimization there is

(if any Cython super-experts are out there and would like to correct me

on that, please let me know in the blog comment thread!) By comparison, the Numba

version is a simple, unadorned wrapper around plainly-written Python code.

Fortran/F2Py

Another option for fast computation is to write a Fortran function directly, and use

the f2py package to interface with the function. We can write the function

as follows:

In [8]:

We can then use the shell interface to compile the Fortran function. In order

to hide the output of this operation, we direct it into /dev/null (note: I

tested this on Linux, and it may have to be modified for Mac or Windows).

In [9]:

We can import the resulting code into Python to time the execution

of the function. To make sure we're being fair, we'll first convert

the test array to Fortran-ordering so that no conversion needs to

happen in the background:

In [10]:

The result is nearly a factor of two slower than the Cython and Numba versions.

Now, I should note here that I am most definitely not an expert on Fortran, so

there may very well be optimizations missing from the above code. If you see any

obvious problems here, please let me know in the blog comments.

Scipy Pairwise Distances

Because pairwise distances are such a commonly used application in scientific

computing, both Scipy and scikit-learn have optimized routines to compute them.

The Scipy version is a Python wrapper of C code, and can be called as follows:

In [11]:

cdist is about 50% slower than Numba.

Scikit-learn Pairwise Distances

Scikit-learn contains the euclidean_distances function, works on sparse

matrices as well as numpy arrays, and is implemented in Cython:

In [12]:

euclidean_distances is several times slower than the Numba pairwise function

on dense arrays.

Comparing the Results

Out of all the above pairwise distance methods, unadorned Numba is the clear winner,

with highly-optimized Cython coming in a close second. Both beat out the other

options by a large amount.

As a summary of the results, we'll create a bar-chart to visualize the timings:

Edit: I changed the "fortran" label to "fortran/f2py" to make clear that this

is not raw Fortran.

In [13]:

In [14]:

Out[14]:

Note that this is log-scaled, so the vertical space between two

grid lines indicates a factor of 10 difference in computation time!

When I compared Cython and Numba last August, I found that Cython was about

30% faster than Numba. Since then, Numba has had a few more releases, and both

the interface and the performance has improved. On

top of being much easier to use (i.e. automatic type inference by autojit)

it's now about 50% faster, and is even a few percent faster than the Cython option.

And though I've seen similar things for months, I'm still incredibly impressed

by the results enabled by Numba: a single function decorator results in a

1300x speedup of simple Python code.

I'm becoming more and more convinced that Numba is

the future of fast scientific computing in Python.

This post was written entirely as an IPython notebook.

The full notebook can be downloaded

here,

or viewed statically on

nbviewer