tl;dr - (note shorter bar is better)

So I started looking at potential to use OpenMP to speed up various Cythonized operations in pandas. I picked the easiest starting point which was using `cython.parallel.prange`

. The test case I used was a 10000-by-100 matrix. So I started looking at potential to use OpenMP to speed up various Cythonized operations in pandas. I picked the easiest starting point which was using `cython.parallel.prange`

.

The test case I used was a 10000-by-100 matrix.

Let’s use `numpy.corrcoef`

to establish a baseline:

In [2]: arr = np.array(randn(100, 10000))

In [3]: %timeit np.corrcoef(arr)

10 loops, best of 3: 24.6 ms per loop

Order doesn’t matter much for `np.corrcoef`

:

In [4]: arr = np.array(randn(100, 10000), order=’F’)

In [5]: %timeit np.corrcoef(arr)

10 loops, best of 3: 28 ms per loop

This represents the “best-case” scenario. With DataFrame, the correlation matrix computation needs to be done pair-wise by column so missing data can be handled. As a result, it is significantly slower than `np.corrcoef`

.

Let’s see just how much NA-handling costs:

In [6]: arr = randn(10000, 100)

In [7]: df = DataFrame(arr)

In [8]: %timeit df.corr()

1 loops, best of 3: 644 ms per loop

Because this computation is by column, the order of the underlying data matters. Here’s the runtime of DataFrame.corr using a Fortran-order ndarray:

In [9]: df = DataFrame(np.array(arr, order=’Fortran’))

In [10]: %timeit df.corr()

1 loops, best of 3: 231 ms per loop

The speed-up here is *very* significant.

Up until this point, no OpenMP optimizations have been performed. Next, I modified the code in the `nancorr`

function to use `prange`

instead of range. I configured it to use 8 threads (I’m on a core-i7 ivy-bridge) with dynamic scheduling.

Using prange in the nancorr function in pandas/src/moments.pyx:

`for xi in parallel.prange(K, nogil=True, num_threads=8, schedule='dynamic'):`

Now let’s see a performance comparison:

In [2]: arr = randn(10000, 100)

In [3]: df = DataFrame(arr)

In [4]: %timeit df.corr()

1 loops, best of 3: 373 ms per loop

This was roughly a 42% speed-up from the C-order single-threaded case. I was basing my expectations on Mike Müller‘s talk on Cython/OpenMP at EuroScipy 2012 so this didn’t surprised me. **Notice that the F-order single-threaded case is still significantly faster.**

We can achieve a proportionate speed-up on the F-order array using the multithreaded version of *nancorr*:

In [5]: df = DataFrame(np.array(arr, order=’F’))

In [6]: %timeit df.corr()

10 loops, best of 3: 122 ms per loop

This 8-threaded with dynamic scheduling was the best configuration out of the set that I tried. I didn’t get a chance to flatten the nested for-loop to create equal run-times for each thread, but maybe static scheduling with that config would be more optimal.

With F-order and optimally configured OpenMP, we’ve closed the performance gap with NA-unaware np.corrcoef from *~26x to* ~*5x*.* *The main contributor to this speed-up is having correctly ordered arrays, but on both C and Fortran orders, using *prange* represented a ~42% speed-up from the single-threaded case.

Hardware – 2012 Macbook Pro 2.3 GHz core-i7, osx 10.8.2, 16GB RAM, SSD