2015-01-11

Preface
Hi! My name is Alexander Borzunov. This article is the translation of my original publication. Sharing my work with English-speaking community and happy to answer all the questions.

Suppose we want to calculate a tenth million Fibonacci number with a program in Python. The function using a trivial algorithm will be calculating it for more than 25 minutes on my PC. But if we apply a special optimizing decorator, the function will give you the answer in just 18 seconds. (85 times faster):



The thing is that before the program execution Python interpreter compiles all its parts into a special byte-code. Using the method I came across recently, the decorator analyzes the function byte-code and tries to optimize the algorithm applied there. You will see that this optimization can speed up the program not by a factor of n times, but asymptotically. The bigger the number of iterations in a loop is, the more times the optimized function (in contrast to the initial one) speeds up.

This article tells about when and how the decorator manage to make such optimization. You will also be able to download and test cpmoptimize library containing the decorator.

Theory

Some time ago I came across an interesting article describing an interpreter of a limited language that supported the following operations:

Values of variables in this language should be numbers. Their size is not limited (arbitrary-precision arithmetic is supported). Actually, the variables are stored as whole numbers in Python. It switches to the mode of arbitrary-precision arithmetic in case if the number goes beyond the hardware-supported 4- or 8-byte type.

Let’s review the case, in which arbitrary-precision arithmetic is not used. The execution time of an operation will not depend on variable values and therefore all the iterations in the loop will have the same execution time. If we execute such code in a direct way, like in regular interpreters, the loop will take , in which n is the number of iterations. In other words, if the code for n iterations takes t time, then the code for n2 iterations will take nt time (see time complexity).

The interpreter from the original article represents each operation with variables in the form of a matrix, which is multiplied by a vector with initial values of variables.  As a result of multiplication, we will get a vector with new variable values:



Thus, to perform a sequence of operations, we should multiply a vector by a matrix that corresponds to the next operation. Another way — we can first multiply matrices by each other and then multiply the initial vector by the obtained product:

To execute the loop, we should compute n — the number of iterations in the loop, and B matrix — a consequent multiplication of matrices that correspond to iterations from the loop body. Then we should multiply the initial vector by B matrix n times. Another way is to raise B matrix to the power of n and then multiply the vector with variables by the obtained result:

When using the binary exponentiation algorithm we can reduce the runtime of a loop to . If the code for n iterations operates in t time, then the code for n2 iterations will operate in 2t time (it’s only two times slower, but not n times, like in the previous case).

As a result, the bigger the number of iterations in n loop is, the bigger number of times the optimized function speeds up. Therefore, the effect from optimization will be well noticed for big n.

If we do use arbitrary-precision arithmetic, time estimations can be wrong. For example, when computing the nth Fibonacci number, variable values will become greater from iteration to iteration. Thus, operations containing them will slow down. It’s more complicated to execute the asymptotical estimation of the program execution time. We would have to take into account the length of numbers in variables during each iteration, as well as used methods of number multiplications. Therefore, we are not going to review it here. Nevertheless, after applying this method of optimization, asymptotics worsens even in this case. You can see it on the chart below:

The Idea

If dozens of years ago a programmer would have to multiply a variable by 4, he would use not the multiplication operator, but its faster equivalent — a left bit shift operation performed twice. Nowadays compilers themselves can make similar optimizations. In efforts to shorten the development time, there appeared languages with a high level of abstraction, as well as new handy technologies and libraries. When writing programs, developers spend most of their time explaining the computer what programs should do (multiply a number by 4), but not how to make things more efficient (bitwise operation). Thus, the task of creating efficient code is partially given to compilers and interpreters.

So far, compilers can replace various operations with more efficient ones. They can also predict expression values and delete or swap parts of the code. However, compilers do not replace the computational algorithm, written by a human, by a more asymptotically efficient one.

In practice, it’s quite problematic to use the implementation of a method described in the mentioned above article, as we would have to rewrite part of the code in a special language. As for interacting with this code, we would have to run an exterior script with an interpreter. However, the author proposes to use its expertise into practice in optimizing compilers. I tried to implement this method to optimize programs in Python, in the form that is suitable for use in real projects.

Using my library it will be enough just to write one line of code before the function you want to optimize. You’d have to use a special decorator for that purpose. It optimizes loops, if possible, and will not “break” the program under any circumstances.

Use Cases

Let’s take a look at some examples the decorator may come in handy.

Case #1. Long Loops

Suppose there is a sequence corresponding to the rule that you can see on the chart. We are to compute the nth member of the sequence:

Intuitively, the way another sequence member appears is quite clear. However, we need time to derive and prove an appropriate mathematic expression. Nevertheless, we can write a trivial algorithm describing our rule, and then tell our computer to think out the way of a fast computation of the task:

It is interesting that if we run this function for n=101000, there will be started a loop with 101000 iterations. Without optimizing, this function could not finish its work in any conceivable interval of time. But a decorator will help us to execute it in less than a second. In addition, it will return a correct result.

Case #2. Fibonacci Numbers

To compute the nth Fibonacci number, there is a fast and non-trivial algorithm based on the idea of raising a matrix to power. An experienced engineer can implement it in a few seconds. If the code, that makes similar calculations, does not have to operate fast (for example, when it’s required to compute a Fibonacci number just once, with a number being less than 10 thousands), the developer is likely to save efforts and write a classical solution in few seconds.

But if a program is to run fast, either we will have to use all of our efforts and write a complex algorithm, or we can utilize the method of automatic optimization and apply the decorator to our function. If n is quite big, programs’ performance will be almost the same in both cases. But the developer will save his time in the second case.

Below are the charts and tables, so you will be able to compare operation time of the described methods for small and big n.

We should say that there is another algorithm to compute Fibonacci numbers, which is known as fast doubling. It beats the previous algorithms in time, as it has no unnecessary number additions and multiplications. You can compare the time of calculations via this algorithm as well. See some charts below.

Time measurement results:

In practice, we may face with a more complicated situation, if instead of the well-known recurrence formula

a sequence of numbers is defined by a more complex one, like this:

We will not be able to find implementation of the mentioned above algorithms on the Internet. We would also have to spend plenty of time on deriving an algorithm of raising matrices to power. However, we can write a nice brute force solution and apply a decorator. In this case, the computer will derive a quick algorithm for us:

Case #3. The Geometric Progression Sum

Suppose we need to compute the count sum of geometrical progression members. The first member of the progression is start, the denominator is coeff. The decorator can optimize our trivial solution:

Case #4. Raising to Power

Suppose we are to raise a number to a non-negative integer power of n. The decorator will actually replace the trivial solution with the algorithm of binary exponentiation.

How to Install

You can use pip to install the library:

The source code is available on github under a free MIT license.

The library is also available on the Python Package Index.

Examples of optimization application are in "tests/" folder. Each example consists of a function representing a brute-force solution, as well as its optimized variant, and also functions representing complex, but fast solutions of the given task. These functions are placed in scripts that start them on various groups of tests, and also measure their execution time, and build appropriate charts. If you have matplotlib, the scripts will also build the plots that depend on execution time and input data (regular or logarithmic). All the plots will be saved in "tests/plots/" folder.

The Library Description

When creating the library, I used the fact that the byte-code of the program, that is created by Python interpreter, can be analyzed and modified during the runtime without interfering with the interpreter. This provides huge opportunities as for the language extension. Advantages of this method usually show themselves when we use arbitrary-precision arithmetic that is also built-in in Python.

That’s why implementation of the described optimization in Python has become a more interesting task for me. Its use in C++ compilers would definitely allow to create even faster programs. Besides, optimized by the decorator, the code in Python usually excels the brute-force code in C++, thanks to better asymptotics.

When modifying the byte-code, I used byteplay module, which provides a nice interface to disassembly and assembly the byte-code. Unfortunately, they have not ported the module to Python 3. That's why I implemented the whole project in Python 2.7.

cpmoptimize library name stands for "compute the power of a matrix and optimize". We are not going to review the process of analyzing the byte-code in details. But I am going to describe possibilities provided by the library and principles its operation is based on.

Own xrange

cpmoptimize.xrange(stop)

cpmoptimize.xrange(start, stop[, step])

For optimization purposes, the inbuilt xrange type in Python 2 does not support long type integers as arguments:

Since the loops with about 101000 iterations execute in less than a second and can become a regular thing in a program, the library provides its own implementation of xrange type in pure Python (internally called CPMRange). It supports all the features of a regular xrange and, in addition, the arguments of type long. Such code will not lead to errors and will compute the right result very fast:

However, if the amount of iterations is not big, you can keep using the inbuilt xrange together with the decorator.

cpmoptimize Function

cpmoptimize.cpmoptimize(strict=False, iters_limit=5000, types=(int, long), opt_min_rows=True, opt_clear_stack=True)

Applying the Decorator

cpmoptimize function accepts parameters of the derived optimization and returns the decorator that takes these parameters into account. Returned decorator should be applied to the function being optimized. We can do it by using a special syntax (the “at” sign, but we can also do without it):

Before making changes to the byte-code, initial function is being copied. Thus, f and smart_g functions in the above code will be optimized, but naive_g will not.

Searching for for Loops

The decorator looks for standard for loops in the byte-code, but at the same time it ignores the while and for loops inside generator and list expressions. Nested loops are not supported for now (the most internal loop is optimized). for-else are processed correctly.

Valid Types

The decorator can not optimize a loop with no regard of types of variables used in it. For example, Python allows to define a type, which will write a string into a file after each multiplication. As a result of optimization, the number of performed multiplications in the function could change. Thus, the number of lines in the file would become different and optimization would “break” the program.

Besides, variables are not added and multiplied implicitly during operations with matrices and that’s why variable types can “get mixed up”. If one of constants or variables was of float type, variables that were to get int type after the code execution could also become float (int and float addition returns float). Such behavior can cause errors in the program, which is unacceptable.

To avoid unwanted side effects, the decorator optimizes the loop only in case of all constants and variables used in it are of an acceptable type. int and long types are initially valid, as well as types inherited from them. If one of constants or variables is of long type, those variables to become of int type after the code execution, can also get long type. But since int and long are quite interchangeable types in Python, this should not lead to an errors.

If you want to change the set of acceptable types (for example, to support float), pass a tuple of necessary types in types parameter. The types from this tuple (or the types that were inherited from the types of that tuple) will be treated as acceptable. In practice, to check whether any value is of acceptable type, we should use isinstance(value, types) expression.

Loop Optimization Conditions

Each found loop should meet some conditions. Some of them have already checked after the application of a decorator:

The loop body should contain only assignment operators, unary and binary operations, which can be combined into complex expressions. The body cannot contain conditions, other functions’ calls, return and yield operators, etc.

A condition of predictability can be required from operands: their value should be the same after each iteration and it should not depend on the result of any kind of calculations from the previous iteration. In addition:

All addition and subtraction operands, as well as the unary minus operand, can be unpredictable.

At least one multiplication operand should be predictable. (Similarly to the way the original interpreter only supported multiplication by a constant).

All power operands, division operators, modulo and bitwise operations should be predictable.

All the constants used in the loop, should be of acceptable type.

If all the conditions are met, there will be set a “hook” before the loop, which will check the other part of conditions. The original byte-code of the loop will not be deleted. Since Python provides dynamic typing, we can check the following conditions just before the loop starts:

The object traversed by the loop, should be of xrange standard type or its alternative from cpmoptimize.xrange library. The slow range function that returns a list is not supported.

Values of all the variables used in the loop have a valid type.

If the “hook” has concluded that optimization is possible, then all the required matrices will be calculated, and values of variables being used will change to a new ones. If the optimization is not possible, then the original byte-code of the loop will be in use.

Expressions. Moving the Code Outside of the Loop

Despite the fact that the described method does not support exponential and “bitwise AND” operations, the following code will be optimized:

When analyzing the byte-code, the decorator will draw a conclusion that all the values of k and m variables in (k ** m) & 676 expression do not depend on the loop iteration they are used in, which means that values of the whole expression does not change either. It’s not necessary to calculate it during each iteration. We can just calculate this value once.

Thus, we can move corresponding instructions of the byte-code and execute them right before the loop start by substituting the ready-value into the loop. So the optimized version of the same code could look like this:

This technique is known as the loop-invariant code motion. Please note that the moved values are calculated each time you start the function, as they can depend on volatile global variables or function parameters (like _const1 depends on k parameter in the example). Now, we can easily optimize the obtained code with the help of matrices.

At this stage, the described above check of values predictability is carried out.

For instance, if one of “bitwise AND” operands would turn out to be unpredictable, we would not be able to move this operation outside of the loop. Thus, we would not be able to apply optimization.

The decorator also partially supports multiple assignments. If there are few elements in them, Python creates supported by the decorator byte-code without the use of tuples:

iters_limit

On the chart above you can see that with a small number of iterations, optimized loop may run slower than usual, as in this case, some time is still required to construct matrices and check types. When it is necessary for the function to work as fast as possible even with a small number of iterations, we can define a limit by iters_limit parameter. Then before starting the optimization, the “hook” will check the number of iterations the loop is about to make and will cancel the optimization, if this number does not exceed the defined threshold.

Default limit is 5,000 iterations. It cannot be less than 2 iterations.

It’s clear that the most advantageous limit value is the point of the intersection of lines on the chart. The lines correspond to operation time of the optimized variant of the function and the naive one. In this case, if we set the limit value to 10,000, the function will be able to choose the most advantageous algorithm (optimized or the naive one):

strict Flag

In some cases, it’s strictly required to apply optimization. If a loop has 101000 iterations, the program will just hang without optimization. Fortunately, there is a strict parameter for such cases. Its default value is False, but if we set it to True, then in case if any of standard for loops has not been optimized, it will raise an exception.

If optimization inapplicability has been detected when applying the decorator, cpmoptimize.recompiler.RecompilationError exception will be thrown. Two unpredictable variables are multiplied in the given example:

If the fact of optimization inapplicability has been detected before the loop started (meaning that value types of iterator or variables are not supported), it will trigger TypeError exception:

Additional Optimization Options

opt_min_rows and opt_clear_stack flags are responsible for additional optimization methods when constructing matrices. They are enabled by default and the possibility of disabling them is just for demonstration purposes only (so that we could compare efficiency of these methods).

When executing the optimized code, multiplication of long numbers in some cells of generated matrices takes most of the time. Compared to this time consuming process, the time spent on the multiplication of all the other cells is negligible.

When recompiling expressions, the decorator creates intermediate, virtual variables that are in charge of positions of the interpreter stack. After complex calculations, they can contain long numbers that are already saved in some other, real variables. We do not need to store these values one more time, but they are kept in cells of the matrix and slow down operation of the program essentially, as we have to multiply these long numbers when multiplying matrices. opt_clear_stack option is responsible for clearing such variables: if we assign zero to them after the use, there will be no long values in them.

This option is especially essential, when a program works with really big numbers. Eliminating redundant multiplications allows to speed up the program more than twice.

opt_min_rows option enables the minimization of the size of square matrices we multiply. It excludes series corresponding to unused and unpredictable variables. If we do not use the loop variable itself, it will also exclude series being in charge of supporting its correct value. Not to affect the program operation, all these variables will be assigned with correct end values after the loop. In the original article, there was an element equal to one that was added to the vector of variables. If the series corresponding to this element turns out to be unused, it will also be excluded.

This option can somewhat speed up the program for not a really big n, when numbers have not become really long and the sizes of matrix bring in an essential contribution during the program runtime.

If we apply the two options at the same time, virtual variables will become of predictable (zero) value and opt_min_rows will work even more efficiently. In other words, efficiency of the two options is better than efficiency of each of them taken separately.

The chart below demonstrates the runtime to calculate Fibonacci numbers when some or other options are disabled:

-m means that the opt_min_rows option is disabled;

-c means that the opt_clear_stack option is disabled;

-mc means that both opt_min_rows and opt_clear_stack options are disabled.

What Else Can We Implement?

Changing the Set of Operations

Let’s say that our method supports a pair of  operations, if

we can perform an  operation either for two variables, or for a variable and a constant;

we can perform an  operation for a variable and a constant.

We already know that the method supports a pair of  operations. And really:

we can add either two variables, or a variable and a constant;

we can multiply a variable by a constant.

For the time being, these operations are implemented in the optimizing decorator. But it turns out that the described method also supports other pairs of operations, including  (a pair of multiplication and exponentiation).

Suppose our task is to find the nth number in a sequence defined by a recurrence relation:

If our decorator supported a pair of  operations, we could multiply a variable by another variable (but we would not be able to add variables). In this case, the decorator could optimize the following trivial solution of the task:

We can prove that our method supports ,  and  pairs of operations (and here is the bitwise AND, or is the bitwise OR). In case of positive numbers, we can also work with  and  pairs of operations.

To implement the support of  operations, we can operate not with variable values, but with logarithms of these values. Then variables multiplication will be replaced with logarithms addition. As for raising a variable to a constant power, it will be replaced with logarithm multiplication by a constant. Thus, we will lead the task to the already implemented case with  operations.

As for implementing the support of the other mentioned pairs of operations, its explanation contains a number of computations and is worth writing a separate article about it. For now, we should just mention that pairs of operations are interchangeable to some extent.

Nested Loops

After refining the algorithm of loops analysis in the byte-code, we can implement the support for nested loops, so that we could optimize the code similar to this:

Predictable Conditions

In the following code the condition in the loop body is predictable. We can also check before the beginning of the loop, whether it’s true, and remove the unreachable branch of the code. After that, we can optimize the code:

Summary

Actually, Python interpreter creates unpredictable byte-code that quite corresponds to the initial code you wrote. Normally, it produces neither function embedding, nor loop unfolding, and no other optimizations requiring the analysis of program’s behavior. And only after 2.6 CPython learnt how to fold constant arithmetic expressions. And still, it does not always work well.

There are few reasons for that. First of all, in the general case, the code analysis in Python is complicated. We can trace data types during the runtime only (that’s exactly what we do in our case). Secondly, optimizations prevent Python from reaching the speed of purely compiled languages. Therefore, when a program needs to work really fast, it’s suggested to write it in a lower-level language.

However, Python is a flexible language allowing to implement many optimization methods without interfering in the interpreter. The given library, as well as other projects, demonstrates this ability.

Useful Links

Here are some interesting projects related to various optimizations in Python:

PyPy is an alternative Python interpreter supporting JIT compilation. As a rule, it allows to execute programs in Python faster. PyPy is written in RPython language.

Pyston is a new alternative Python interpreter. It translates code into an intermediate LLVM form, from which it can be optimized by LLVM means and executed using JIT compilation.

Nuitka is a Python compiler. In contrast to py2exe converter that creates an *.exe file containing the script, the interpreter and necessary libraries, Nuitka really compiles programs in Python into ready executable files.

WPython is a modified Python 2.6.4 interpreter with a more efficient model of the byte-code, a smart system of folding constants and a virtual machine.

astoptimizer is a library applying known optimization methods before compiling to byte-code by means of analyzing and changing an Abstract syntax tree.

promise is a library optimizing the byte-code, relying on the so-called “promises”. If a programmer promises not to make some sort of operations in his code, it becomes possible to apply optimizations that would be inapplicable in the general case.

foldy.py is a library, which analyzes the byte-code and performs constant folding, meaning replacing constant expressions with their values, and also deleting unusable functions.

The Decorator — analyzes the byte-code and performs constant binding, i.e. replaces the search of value of a constant global variable by name with the direct load of a corresponding constant.

NumPy is a library (with modules for Python that are written in C) that can make fast computations with big arrays and matrices, and also use a vast set of the high-level mathematic functions.

Numba is a library that speeds up programs that contain mathematic calculations and operations with arrays. Optimization occurs thanks to JIT compilation (when using LLVM) that converts code to native CPU and GPU instructions.

Numexpr is a library that increases the speed of computations of mathematic expressions via analyzing and changing the appropriate byte-code. Expressions can be split into several parts (where some of them can be recomputed less often than the other ones), gain in performance due to code paralleling and so on.

Cython is an optimizing compiler of Python language superset that allows to use static typing and to interact closely with the C/C++ code.

An article about libraries that can accelerate various calculations in Python via JIT compilation.

shedskin is an experimental compiler, that can translate pure, but implicitly statically typed Python (2.4-2.6) programs into optimized C++.

hope — Python JIT compiler for astrophysical computations.

pythran is a python to c++ compiler meant to efficiently compile scientific programs, and takes advantage of multi-cores and SIMD instruction units.

Show more