Initial Experiments with OpenMP

tl;dr  - (note shorter bar is better)

So I started looking at potential to use OpenMP to speed up various Cythonized operations in pandas. I picked the easiest starting point which was using cython.parallel.prange. The test case I used was a 10000-by-100 matrix. So I started looking at potential to use OpenMP to speed up various Cythonized operations in pandas. I picked the easiest starting point which was using cython.parallel.prange.

The test case I used was a 10000-by-100 matrix.

Let’s use numpy.corrcoef to establish a baseline:

In [2]: arr = np.array(randn(100, 10000))

In [3]: %timeit np.corrcoef(arr)
10 loops, best of 3: 24.6 ms per loop

Order doesn’t matter much for np.corrcoef:

In [4]: arr = np.array(randn(100, 10000), order=’F’)

In [5]: %timeit np.corrcoef(arr)
10 loops, best of 3: 28 ms per loop

This represents the “best-case” scenario. With DataFrame, the correlation matrix computation needs to be done pair-wise by column so missing data can be handled. As a result, it is significantly slower than np.corrcoef.

Let’s see just how much NA-handling costs:

In [6]: arr = randn(10000, 100)

In [7]: df = DataFrame(arr)

In [8]: %timeit df.corr()
1 loops, best of 3: 644 ms per loop

Because this computation is by column, the order of the underlying data matters. Here’s the runtime of DataFrame.corr using a Fortran-order ndarray:

In [9]: df = DataFrame(np.array(arr, order=’Fortran’))

In [10]: %timeit df.corr()
1 loops, best of 3: 231 ms per loop

The speed-up here is very significant.

Up until this point, no OpenMP optimizations have been performed. Next, I modified the code in the nancorr function to use prange instead of range. I configured it to use 8 threads (I’m on a core-i7 ivy-bridge) with dynamic scheduling.

Using prange in the nancorr function in pandas/src/moments.pyx:

for xi in parallel.prange(K, nogil=True, num_threads=8, schedule='dynamic'):

Now let’s see a performance comparison:

In [2]: arr = randn(10000, 100)

In [3]: df = DataFrame(arr)

In [4]: %timeit df.corr()
1 loops, best of 3: 373 ms per loop

This was roughly a 42% speed-up from the C-order single-threaded case. I was basing my expectations on Mike Müller‘s talk on Cython/OpenMP at EuroScipy 2012 so this didn’t surprised me. Notice that the F-order single-threaded case is still significantly faster.

We can achieve a proportionate speed-up on the F-order array using the multithreaded version of nancorr:

In [5]: df = DataFrame(np.array(arr, order=’F’))

In [6]: %timeit df.corr()
10 loops, best of 3: 122 ms per loop

This 8-threaded with dynamic scheduling was the best configuration out of the set that I tried. I didn’t get a chance to flatten the nested for-loop to create equal run-times for each thread, but maybe static scheduling with that config would be more optimal.

With F-order and optimally configured OpenMP, we’ve closed the performance gap with NA-unaware np.corrcoef from ~26x to ~5x. The main contributor to this speed-up is having correctly ordered arrays, but on both C and Fortran orders, using prange represented a ~42% speed-up from the single-threaded case.

Hardware – 2012 Macbook Pro 2.3 GHz core-i7, osx 10.8.2, 16GB RAM, SSD

About these ads

About Chang She

I'm a co-founder at DataPad (datapad.io) where we're building the next generation agile data analytics platform using cutting edge data engineering and web-based visualizations. I love open source software and was a core pandas (pandas.pydata.org) developer. In my former life, I was a quant researcher/trader in equities (AQR Capital Management) and FX (Barclays Capital).
This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s