Looking at the Bottleneck library for ideas to speed up pandas Series/DataFrame.replace, I see a set of posted benchmarks with bottleneck’s replace at roughly 4 times as fast as the implementation using numpy.putmask (and numpy.isnan to create an intermediate masking array). Just to verify, I ran my own benchmarks. A little bit of a surprise to me, but the speed up only occurs if the value to be replaced is NaN.

Here’s the setup with a 1,000,000 element floating point array with a bunch of elements set to NaN and 0 which we will replace:

import bottleneck as bn

import numpy as np

arr = np.random.randn(1e6)

arr[np.random.randint(0, 1e6-1, 4e5)] = np.nan

arr[np.random.randint(0, 1e6-1, 4e5)] = 0

Look for NaN and replace:

a = arr.copy()

%timeit np.putmask(a, np.isnan(a), -1)

>> 100 loops, best of 3: 3.31 ms per loop

a = arr.copy()

%timeit bn.replace(a, np.nan, -1)

>> 1000 loops, best of 3: 783 us per loop

Look for other values and replace:

a = arr.copy()

%timeit np.putmask(a, a == 0, -2)

>> 1000 loops, best of 3: 1.62 ms per loop

a = arr.copy()

%timeit bn.replace(a, 0, -2)

>> 100 loops, best of 3: 2.47 ms per loop

I thought maybe it was numpy.isnan that’s the difference, but that’s not the case:

a = arr.copy()

mask = np.isnan(a)

%timeit np.putmask(a, mask, -1)

>> 100 loops, best of 3: 2.29 ms per loop

Have to look deeper at the Bottleneck/Numpy code but as it stands, bottleneck replace useful for fillna, but probably not for replacing arbitrary values.

### Like this:

Like Loading...

*Related*

## About Chang She

Engineer @ Cloudera. Ex-cofounder/CTO @ DataPad. Builder of data tools. Recovering financial quant.