Google Analytics integration in pandas

We just released 0.10.0 of pandas. One of the new features in this release is integration with Google Analytics (GA). For those of you who need to download GA data and do custom analysis in pandas, this should make your life a little easier. This modules relies on oauth2 to access data, so if you don’t have it setup, jump to the end of this article to see a brief setup guide.

The interface lives in the module pandas.io.ga and the main function is read_ga:

In [18]: import pandas.io.ga as ga

In [19]: df = ga.read_ga(['visits', 'avgTimeOnSite'], 
                         dimensions=['date', 'hour'], 
                         start_date='2012-10-25')
In [20]: df
Out[20]: 
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 1296 entries, (2012-10-25 00:00:00, 0) to (2012-12-17 00:00:00, 23)
Data columns:
visits           1296  non-null values
avgTimeOnSite    1296  non-null values
dtypes: float64(1), int64(1)

This query gets the number of visits and the average visit duration by date and hour since 2012-10-25 and returns the results in a DataFrame with ‘visits’ and ‘avgTimeOnSite’ as columns. The dimension variables ‘date’ and ‘hour’ have been set as the index. For a complete list of valid variable names, visit the google analytics developer guide.

But wait, the date and the hour of day are used as separate variables and the index is a hierarchical index. What if you want to combine the two into actual timestamps so you can take advantage of pandas timeseries API? Just like in read_csv, you can specify the parse_dates keyword:

In [25]: df = ga.read_ga(['visits', 'avgTimeOnSite'],
                         dimensions=['date', 'hour'],
                         start_date='2012-10-25',
                         parse_dates=[['date', 'hour']])
In [26]: df.head()
Out[26]:
visits avgTimeOnSite
date_hour
2012-10-25 00:00:00 1 37
2012-10-25 01:00:00 1 0
2012-10-25 02:00:00 0 0
2012-10-25 03:00:00 0 0
2012-10-25 04:00:00 4 0

And now I’ve got a DatetimeIndex so I can start doing resample, betwee_time and all the other timeseries API goodness.

If you have a LOT of data, you may not want to make a big request all at once, so there is a keyword called chunksize that will make read_ga return an iterator that yields DataFrames instead of a single DataFrame:

In [28]: it = ga.read_ga(['visits', 'avgTimeOnSite'], 
                         dimensions=['date', 'hour'], 
                         start_date='2012-10-25', 
                         parse_dates=[['date', 'hour']], 
                         chunksize=10)

In [29]: it.next()
Out[29]: 
                     visits  avgTimeOnSite
date_hour                                 
2012-10-25 00:00:00       1      37.000000
2012-10-25 01:00:00       1       0.000000
2012-10-25 02:00:00       0       0.000000
2012-10-25 03:00:00       0       0.000000
2012-10-25 04:00:00       4       0.000000
2012-10-25 05:00:00       0       0.000000
2012-10-25 06:00:00       0       0.000000
2012-10-25 07:00:00       3      30.666667
2012-10-25 08:00:00       0       0.000000
2012-10-25 09:00:00       4       0.000000

In [30]: it.next()
Out[30]: 
                     visits  avgTimeOnSite
date_hour                                 
2012-10-25 10:00:00       8      20.625000
2012-10-25 11:00:00       6      41.666667
2012-10-25 12:00:00       3     451.000000
2012-10-25 13:00:00       2      28.000000
2012-10-25 14:00:00       4     441.750000
2012-10-25 15:00:00       5      50.000000
2012-10-25 16:00:00       2     118.500000
2012-10-25 17:00:00       6      35.000000
2012-10-25 18:00:00       2       4.500000
2012-10-25 19:00:00       0       0.000000

Setup guide:

For those of you who don’t have Google Analytics setup with oauth2, you need to do a few things first before you can start accessing your data.

First, you need to go to the Google API console and create a new project. When you create the new project, make sure you turn on the “Analytics API” as a service you need. Next, click “API Access” and click the big button that says “Create OAuth2.0 client ID”. Make sure you select “Installed application” in the last step. Once the client ID has been created, you should see your client ID, secret, and redirect URIs (should be oob). Then click “Download JSON” on the right and you’ll download a “client_secrets.json”. By default, the client secrets should be in the same directory as pandas.io.ga but you can specify the secrets keyword in read_ga to the location of your secrets file.

Posted in Uncategorized | Tagged , , , | 27 Comments

Special characters as variable names

Don’t use them.

If you name a variable “s” or “n” or “c” in your nosetest test suite, and jump into the frame in PDB to debug, any expression starting with your variables will issue those commands to the debugger and you’ll have to do everything over again.

“d”, “u”, and “l” are not as bad but equally exasperating.

Posted in Uncategorized | Leave a comment

Initial Experiments with OpenMP

tl;dr  – (note shorter bar is better)

So I started looking at potential to use OpenMP to speed up various Cythonized operations in pandas. I picked the easiest starting point which was using cython.parallel.prange. The test case I used was a 10000-by-100 matrix. So I started looking at potential to use OpenMP to speed up various Cythonized operations in pandas. I picked the easiest starting point which was using cython.parallel.prange.

The test case I used was a 10000-by-100 matrix.

Let’s use numpy.corrcoef to establish a baseline:

In [2]: arr = np.array(randn(100, 10000))

In [3]: %timeit np.corrcoef(arr)
10 loops, best of 3: 24.6 ms per loop

Order doesn’t matter much for np.corrcoef:

In [4]: arr = np.array(randn(100, 10000), order=’F’)

In [5]: %timeit np.corrcoef(arr)
10 loops, best of 3: 28 ms per loop

This represents the “best-case” scenario. With DataFrame, the correlation matrix computation needs to be done pair-wise by column so missing data can be handled. As a result, it is significantly slower than np.corrcoef.

Let’s see just how much NA-handling costs:

In [6]: arr = randn(10000, 100)

In [7]: df = DataFrame(arr)

In [8]: %timeit df.corr()
1 loops, best of 3: 644 ms per loop

Because this computation is by column, the order of the underlying data matters. Here’s the runtime of DataFrame.corr using a Fortran-order ndarray:

In [9]: df = DataFrame(np.array(arr, order=’Fortran’))

In [10]: %timeit df.corr()
1 loops, best of 3: 231 ms per loop

The speed-up here is very significant.

Up until this point, no OpenMP optimizations have been performed. Next, I modified the code in the nancorr function to use prange instead of range. I configured it to use 8 threads (I’m on a core-i7 ivy-bridge) with dynamic scheduling.

Using prange in the nancorr function in pandas/src/moments.pyx:

for xi in parallel.prange(K, nogil=True, num_threads=8, schedule='dynamic'):

Now let’s see a performance comparison:

In [2]: arr = randn(10000, 100)

In [3]: df = DataFrame(arr)

In [4]: %timeit df.corr()
1 loops, best of 3: 373 ms per loop

This was roughly a 42% speed-up from the C-order single-threaded case. I was basing my expectations on Mike Müller‘s talk on Cython/OpenMP at EuroScipy 2012 so this didn’t surprised me. Notice that the F-order single-threaded case is still significantly faster.

We can achieve a proportionate speed-up on the F-order array using the multithreaded version of nancorr:

In [5]: df = DataFrame(np.array(arr, order=’F’))

In [6]: %timeit df.corr()
10 loops, best of 3: 122 ms per loop

This 8-threaded with dynamic scheduling was the best configuration out of the set that I tried. I didn’t get a chance to flatten the nested for-loop to create equal run-times for each thread, but maybe static scheduling with that config would be more optimal.

With F-order and optimally configured OpenMP, we’ve closed the performance gap with NA-unaware np.corrcoef from ~26x to ~5x. The main contributor to this speed-up is having correctly ordered arrays, but on both C and Fortran orders, using prange represented a ~42% speed-up from the single-threaded case.

Hardware – 2012 Macbook Pro 2.3 GHz core-i7, osx 10.8.2, 16GB RAM, SSD

Posted in Uncategorized | Leave a comment

Write the test first

It amazes me how a small change in order affects things. When I’m working on pandas, a lot of the work is fixing a bug that is reported on github or stackoverflow. For a while I was just jumping right into it, fix the bug, then add a test case, run the test suite, then push to github.

This generally works well, but with one problem – are you sure your test case actually hit the bug? Usually I’m sure enough that I don’t think about it, but once in a while I make a mistake but I don’t discover it until later on. After awhile, I started getting paranoid, especially when the test case is a little more involved. I want stash the changes to revert to the original code, but that means it’ll revert the test case as well. So often times I’d have to copy-paste the test case somewhere else, stash, copy the test case back, test, stash pop, then retest.

The solution to this problem, as I discovered later on, was a small change in the order of things. You need to write the test case first to make sure it hits the bug correctly, then put in the fix and retest. It saves a ton of work and it gives you a lot more confidence in the process.

Posted in Uncategorized | Tagged , , , | Leave a comment

The Right Tools

I had the pleasure of assembling a new bed frame yesterday using a wrench plus a phillips drill bit instead of a screwdriver. As you can probably imagine, it was loads of fun.

        +      !=  screwdriver

About 2 hours into the 3 hour fun-fest, I began serious contemplation of suicide as I held the bit with one hand and turned it using the wrench with the other. I started developing trick on how I can hold the bit in place with my thumb and then turn, but it only worked if I’m facing the right way in relation to the screw.

In situations like this, the obvious remedy is to go out and buy a freakin’ screwdriver. This is a solution very few people would argue against. Yet for some reason, in the world of software development and data analysis, people are often reluctant to buy, or even learn to use the right tools. Instead, packages upon band-aid packages of workarounds are built to lessen the pain just below the tolerance threshold. Productivity is lost, often without even being noticed in the first place.

On the other hand, when the tool is just right for the job, coding and data analysis can be fast, efficient, and fun.  This is a big part of the reason why I really love using (and by extension working on) pandas. To me, Excel or Matlab are like the wrench + drill bit. They’re not made for data science and anything but the most basic of tasks are very cumbersome to do. R is like the the manual screwdriver; with data.frame, zoo, xts, and a host of incredibly useful libraries, R is still the mainstay of data analysis. However, Python/Pandas is like the cordless screwdriver that never runs out of battery and has a head that automatically morphs into the right shape when presented with a screw, nail, or a hole that needs to be drilled.

Analyze All the Data!

I won’t go into too much detail about pandas (Wes does a much better job of that than me anyways), but it really comes down to performance + useful features + intuitive API. Performance in pandas is achieved by 1) being careful about not copying things if you don’t have to and 2) pushing critical paths down to Cython (e.g., a lot of vectorized operations). At a high-level, pandas stays useful and relevant by staying engaged with the community and looking at feature development with an eye on how data scientists think about and work with data. Things like the groupby engine, agile merging/reshaping/joining functions, and time-series functionality all grew out of experience actually solving problems that required those things. Finally, having an intuitive API is critical to getting a community of non-developers to adopt the library. I’ve heard things from non-computer science majors like “yeah I played around with pandas over the weekend and I found myself just guessing the right functions to call”.

Python/pandas is getting a lot of traction as a tool of choice when analyzing data. Along with essential tools like IPython, NumPy, SciPy, Cython, StatsModels and matplotlib, the scientific Python stack will eventually become the mainstay in a data scientist’s tool-belt.

Posted in Uncategorized | Tagged , , , , | Leave a comment

Fast Replace with Bottleneck

Looking at the Bottleneck library for ideas to speed up pandas Series/DataFrame.replace, I see a set of posted benchmarks with bottleneck’s replace at roughly 4 times as fast as the implementation using numpy.putmask (and numpy.isnan to create an intermediate masking array). Just to verify, I ran my own benchmarks. A little bit of a surprise to me, but the speed up only occurs if the value to be replaced is NaN.

Here’s the setup with a 1,000,000 element floating point array with a bunch of elements set to NaN and 0 which we will replace:

import bottleneck as bn
import numpy as np
arr = np.random.randn(1e6)
arr[np.random.randint(0, 1e6-1, 4e5)] = np.nan
arr[np.random.randint(0, 1e6-1, 4e5)] = 0

Look for NaN and replace:

a = arr.copy()
%timeit np.putmask(a, np.isnan(a), -1)
>> 100 loops, best of 3: 3.31 ms per loop

a = arr.copy()
%timeit bn.replace(a, np.nan, -1)
>> 1000 loops, best of 3: 783 us per loop

Look for other values and replace:

a = arr.copy()
%timeit np.putmask(a, a == 0, -2)
>>  1000 loops, best of 3: 1.62 ms per loop

a = arr.copy()
%timeit bn.replace(a, 0, -2)
>>  100 loops, best of 3: 2.47 ms per loop

I thought maybe it was numpy.isnan that’s the difference, but that’s not the case:

a = arr.copy()
mask = np.isnan(a)
%timeit np.putmask(a, mask, -1)
>>  100 loops, best of 3: 2.29 ms per loop

Have to look deeper at the Bottleneck/Numpy code but as it stands, bottleneck replace useful for fillna, but probably not for replacing arbitrary values.

Posted in Uncategorized | Leave a comment

How Quants Choose Technology

As a former buy-side-ish quant, I think we are terrible customers for builders and sellers of analytical tools. Quants are smart enough that proof-by-intimidation doesn’t work. What’s worse, quants are very particular and very peculiar in their needs. You built a factor attribution system? Well what about rolling up into 5 day or 10 day returns? How about using Newey-West corrections to eliminate the effect of auto-correlation? Hey let’s try taking our risk model, computing an ex-ante covariance matrix for our alpha factors, and then using that derived covariance matrix in a GLS regression. I’ve actually been on both sides of that question so I know how frustrating it can be, but also how important these things are.

On the other hand, quants are also great customers. Because what they need tend to be fairly advanced both in terms of computational and mathematical requirements, there are not a lot of tools out there that meet their main criteria for good software — reliability, flexibility, transparency, usability,  shrink-wrapping, and quality of support. If you are a good technologist who also happen to understand the domain problems, or if you’re a quant who can also code, there is a lot of value that you would be able to add.

Quants tend not to be expert software engineers (there is an important distinction between programmer and software engineer), which means that the NIH (Not-Invented-Here) syndrome is much less among quants. It’s not the architecture and design that rubs quants egos the wrong way, and there is no competitive tension that exists between an internal development team and a software vendor. Instead, what drives quants to want to build software in-house generally tend to be the fact that no software package currently available meet all  of the criteria listed above. Commercial packages like Palantir and ClariFi are inflexible and not transparent (really they’re not even targeted towards quants at all). Traditional analytical software like Matlab is difficult to use as soon as you move away from the core matrix computations. R and other open source software are big steps in the right direction, but R packages are scattered, unsupported, and isn’t nearly as readable (which goes into transparency and usability) as Python.

Out of the criteria listed above, reliability, quality of support, and shrink-wrapping tend to be solved problems. Transparency is also not too difficult and is only complicated by the need to balance between being open source (to customers) versus protecting IP. This tends to be a business problem that can be ameliorated by rapid innovation and consulting relationships to provide more value to the customer than they would obtain by stopping a subscription and building in-house extensions to an older version of your software. Usability is a more difficult problem in that it requires a good deal of domain knowledge. What techniques do quants generally like to use? What are sensible default arguments? What are the most common ways to chain together statistical tools? These are all questions that can only be answered by someone with solid quantative finance experience. And the answers will only be useful for someone with solid skills in software engineering. The final criteria, flexibility, is a fairly complex problem that requires careful consideration. There’s not enough room in a blog to go into details, but two things are important here. One is being able to handle a large number of parameters. What defaults do you set? Do you use a separate configuration object? Do you use variable parameter magic? How do you make sure users who need to do simple tasks aren’t drowning in a sea of parameters that they don’t need to set and shouldn’t have to stress about reading all the documentation? The other issue that is important is how do you structure a flexible framework so that it makes standard tasks very simple, but still leaves room open for the user to leave the framework at any point, add their own customizations, and then come back into the framework?

It may be a bold claim I made that nothing meets the needs of quants, but I think the evidence supports my conclusion very well. This is based on two underlying claims, what exactly quants need in terms of technology, and whether currently available software meets those needs. Are these two claims valid? You be the judge.

Posted in Technology | Tagged , , , , , , , , , | Leave a comment