Planet.scipy.org

GSoC'13 Project Summary-2 : Numpy's Bottlenecks and its Removal

2013-09-24

In last post, I have mentioned the tools which I used for profiling numpy. Call-graph, made my life not easier but a bit simpler to detect time consuming methods. But time taken just indicate the possibility of bottlenecks as many of time consuming methods are highly optimized. Hence, critically reading code along with call-graph is the trick

Identifying bottlenecks

Following types bottlenecks were observed

Overhead for smaller arrays

Numpy used to release Global Interpreter Lock or GIL for all two or one operand loops. But for short length array, it used to produce relative overhead instead.

So, releasing GIL for smaller operations was not benefiting at all.

Redundant code doing extra work

PyArrayCanCastArrayTo used to check evenif newtype is Null. In PyArrayFromArray, It was found that if argument newtype is Null, it get value of oldtype. So, technically it has to check if casting is possible for same type, which is useless.

Numpy used to converts the Python scalar into its matching scalar (e.g. PyLong -> int32) and then extract the C value from the NumPy scalar.

Clearing the error flags at every function call, then checking it. This happens unconditional even-if there is no need to do.

Improper structure

For every single operation calls, numpy has to extract value of buffersize, errormask and name to pack and build error object. These two functions, _extract_pyvals and PyUFunc_GetPyValues together use significant time. It take useless time because all this time is spent on to look up entries in a python dict, extract them, and convert them into C level data. Not once but doing that again and again on every operation. Also which remain unused if no error occurs.

loop selection method for scalar operation is inefficient and consume time. It check through all associated dtypes of function one by one from types array.

Not so effective memory allocation

When allocating the return value, numpy allocate memory twice. One for the array object itself, and a second time for the array data.

Also, shapes + strides together have 2*ndim elements, but to hold them numpy allocate a memory region sized to hold 3*ndim elements.

Removing bottlenecks

I really appreciate the effort and mentoring by Charles Harris, which provides lead to remove above mentioned bottlenecks. Making numpy a bit quick for smaller arrays too.

Removing know overhead for smaller arrays

Avoiding GIL release for smaller arrays

NPY_BEGIN_THREADS_THRESHOLDED(loop_size), variant of NPY_BEGIN_THREADS macro, has been create in ndarraytypes.h with threshold. Threshold has been taken 500 conservatively and loss of guessing low can be neglected.

Avoid doing extra work, remove redundancy

In PyArrayCanCastArrayTo, it is better to bypass this for newtype is Null and flag is 0. As following improvement is made,

Avoiding conversion of know dtype to NumPy Scalar

As, when trying to extract the underlying C value from a Python integer. Numpy converts the Python scalar into its matching scalar (e.g. PyLong -> int32) and then it extracts the C value from the NumPy scalar. Raul Cota did optimization for floats by avoiding it conversation. I extended it to integer which was a bit messy because different OS handle int differently. I have avoid conversion for know integer type but extracting its value. like,

For byte, short, int, long

Avoiding heavy fp clear flag
Nathaniel pointed out something mysterious with floating point clear error-flags.

UFUNC_CHECK_STATUS is just single macro which do both checking clearing the error flags. It clear error flags every time after checking. We should avoid clear operation if not needed, as it is a bit expensive and take significant amount of time. I have avoided clear if not needed to save time. Before each ufunc loop when PyUFunc_clearfperr() flag error is checked, then clearing them if necessary. Now, checking results in macro doesn't get ignored unlike before.

Re-factoring

Deferred the creation of the errobject to when an error actually occurs. This same huge time used to consumed by PyUFunc_GetPyValues, for building error object.

Replacing loop by specialized conditions. Most of the function share identical signature. E.g These sets (add, subtracts) , (arccos, arcsin, arctan, arcsinh, arccosh) share same signature array. As if known, there are only 32 distinct signature arrays. I make code generator to identity and make specialized condition for each distinct signature arrays.

I also tried to stash two memory allocation for value return to one but it would take huge amount of re-factoring. Hence it moved out of scope for GSoC.

Something Interesting...

I came to know about tinyarray, which is similar to NumPy array, but optimized for small sizes. Christoph Groth, who developed it, used more compact object like For example, the array object is a PyVarObject and not a PyObject. The array data is held in the same structure as the Python object itself, only a single memory allocation per Array is done. So this could a good start if someone wants to optimize, memory allocation.

And more...

It is not possible to write about all coding and pull-request on this post. But all those modification have been written on Google doc http://goo.gl/M3vrnz