Google Analytics integration in pandas

We just released 0.10.0 of pandas. One of the new features in this release is integration with Google Analytics (GA). For those of you who need to download GA data and do custom analysis in pandas, this should make your life a little easier. This modules relies on oauth2 to access data, so if you don’t have it setup, jump to the end of this article to see a brief setup guide.

The interface lives in the module pandas.io.ga and the main function is read_ga:

In [18]: import pandas.io.ga as ga

In [19]: df = ga.read_ga(['visits', 'avgTimeOnSite'], 
                         dimensions=['date', 'hour'], 
                         start_date='2012-10-25')
In [20]: df
Out[20]: 
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 1296 entries, (2012-10-25 00:00:00, 0) to (2012-12-17 00:00:00, 23)
Data columns:
visits           1296  non-null values
avgTimeOnSite    1296  non-null values
dtypes: float64(1), int64(1)

This query gets the number of visits and the average visit duration by date and hour since 2012-10-25 and returns the results in a DataFrame with ‘visits’ and ‘avgTimeOnSite’ as columns. The dimension variables ‘date’ and ‘hour’ have been set as the index. For a complete list of valid variable names, visit the google analytics developer guide.

But wait, the date and the hour of day are used as separate variables and the index is a hierarchical index. What if you want to combine the two into actual timestamps so you can take advantage of pandas timeseries API? Just like in read_csv, you can specify the parse_dates keyword:

In [25]: df = ga.read_ga(['visits', 'avgTimeOnSite'],
                         dimensions=['date', 'hour'],
                         start_date='2012-10-25',
                         parse_dates=[['date', 'hour']])
In [26]: df.head()
Out[26]:
visits avgTimeOnSite
date_hour
2012-10-25 00:00:00 1 37
2012-10-25 01:00:00 1 0
2012-10-25 02:00:00 0 0
2012-10-25 03:00:00 0 0
2012-10-25 04:00:00 4 0

And now I’ve got a DatetimeIndex so I can start doing resample, betwee_time and all the other timeseries API goodness.

If you have a LOT of data, you may not want to make a big request all at once, so there is a keyword called chunksize that will make read_ga return an iterator that yields DataFrames instead of a single DataFrame:

In [28]: it = ga.read_ga(['visits', 'avgTimeOnSite'], 
                         dimensions=['date', 'hour'], 
                         start_date='2012-10-25', 
                         parse_dates=[['date', 'hour']], 
                         chunksize=10)

In [29]: it.next()
Out[29]: 
                     visits  avgTimeOnSite
date_hour                                 
2012-10-25 00:00:00       1      37.000000
2012-10-25 01:00:00       1       0.000000
2012-10-25 02:00:00       0       0.000000
2012-10-25 03:00:00       0       0.000000
2012-10-25 04:00:00       4       0.000000
2012-10-25 05:00:00       0       0.000000
2012-10-25 06:00:00       0       0.000000
2012-10-25 07:00:00       3      30.666667
2012-10-25 08:00:00       0       0.000000
2012-10-25 09:00:00       4       0.000000

In [30]: it.next()
Out[30]: 
                     visits  avgTimeOnSite
date_hour                                 
2012-10-25 10:00:00       8      20.625000
2012-10-25 11:00:00       6      41.666667
2012-10-25 12:00:00       3     451.000000
2012-10-25 13:00:00       2      28.000000
2012-10-25 14:00:00       4     441.750000
2012-10-25 15:00:00       5      50.000000
2012-10-25 16:00:00       2     118.500000
2012-10-25 17:00:00       6      35.000000
2012-10-25 18:00:00       2       4.500000
2012-10-25 19:00:00       0       0.000000

Setup guide:

For those of you who don’t have Google Analytics setup with oauth2, you need to do a few things first before you can start accessing your data.

First, you need to go to the Google API console and create a new project. When you create the new project, make sure you turn on the “Analytics API” as a service you need. Next, click “API Access” and click the big button that says “Create OAuth2.0 client ID”. Make sure you select “Installed application” in the last step. Once the client ID has been created, you should see your client ID, secret, and redirect URIs (should be oob). Then click “Download JSON” on the right and you’ll download a “client_secrets.json”. By default, the client secrets should be in the same directory as pandas.io.ga but you can specify the secrets keyword in read_ga to the location of your secrets file.

Posted in Uncategorized | Tagged , , , | 27 Comments

Special characters as variable names

Don’t use them.

If you name a variable “s” or “n” or “c” in your nosetest test suite, and jump into the frame in PDB to debug, any expression starting with your variables will issue those commands to the debugger and you’ll have to do everything over again.

“d”, “u”, and “l” are not as bad but equally exasperating.

Posted in Uncategorized | Leave a comment

Initial Experiments with OpenMP

tl;dr  – (note shorter bar is better)

So I started looking at potential to use OpenMP to speed up various Cythonized operations in pandas. I picked the easiest starting point which was using cython.parallel.prange. The test case I used was a 10000-by-100 matrix. So I started looking at potential to use OpenMP to speed up various Cythonized operations in pandas. I picked the easiest starting point which was using cython.parallel.prange.

The test case I used was a 10000-by-100 matrix.

Let’s use numpy.corrcoef to establish a baseline:

In [2]: arr = np.array(randn(100, 10000))

In [3]: %timeit np.corrcoef(arr)
10 loops, best of 3: 24.6 ms per loop

Order doesn’t matter much for np.corrcoef:

In [4]: arr = np.array(randn(100, 10000), order=’F’)

In [5]: %timeit np.corrcoef(arr)
10 loops, best of 3: 28 ms per loop

This represents the “best-case” scenario. With DataFrame, the correlation matrix computation needs to be done pair-wise by column so missing data can be handled. As a result, it is significantly slower than np.corrcoef.

Let’s see just how much NA-handling costs:

In [6]: arr = randn(10000, 100)

In [7]: df = DataFrame(arr)

In [8]: %timeit df.corr()
1 loops, best of 3: 644 ms per loop

Because this computation is by column, the order of the underlying data matters. Here’s the runtime of DataFrame.corr using a Fortran-order ndarray:

In [9]: df = DataFrame(np.array(arr, order=’Fortran’))

In [10]: %timeit df.corr()
1 loops, best of 3: 231 ms per loop

The speed-up here is very significant.

Up until this point, no OpenMP optimizations have been performed. Next, I modified the code in the nancorr function to use prange instead of range. I configured it to use 8 threads (I’m on a core-i7 ivy-bridge) with dynamic scheduling.

Using prange in the nancorr function in pandas/src/moments.pyx:

for xi in parallel.prange(K, nogil=True, num_threads=8, schedule='dynamic'):

Now let’s see a performance comparison:

In [2]: arr = randn(10000, 100)

In [3]: df = DataFrame(arr)

In [4]: %timeit df.corr()
1 loops, best of 3: 373 ms per loop

This was roughly a 42% speed-up from the C-order single-threaded case. I was basing my expectations on Mike Müller‘s talk on Cython/OpenMP at EuroScipy 2012 so this didn’t surprised me. Notice that the F-order single-threaded case is still significantly faster.

We can achieve a proportionate speed-up on the F-order array using the multithreaded version of nancorr:

In [5]: df = DataFrame(np.array(arr, order=’F’))

In [6]: %timeit df.corr()
10 loops, best of 3: 122 ms per loop

This 8-threaded with dynamic scheduling was the best configuration out of the set that I tried. I didn’t get a chance to flatten the nested for-loop to create equal run-times for each thread, but maybe static scheduling with that config would be more optimal.

With F-order and optimally configured OpenMP, we’ve closed the performance gap with NA-unaware np.corrcoef from ~26x to ~5x. The main contributor to this speed-up is having correctly ordered arrays, but on both C and Fortran orders, using prange represented a ~42% speed-up from the single-threaded case.

Hardware – 2012 Macbook Pro 2.3 GHz core-i7, osx 10.8.2, 16GB RAM, SSD

Posted in Uncategorized | Leave a comment

Write the test first

It amazes me how a small change in order affects things. When I’m working on pandas, a lot of the work is fixing a bug that is reported on github or stackoverflow. For a while I was just jumping right into it, fix the bug, then add a test case, run the test suite, then push to github.

This generally works well, but with one problem – are you sure your test case actually hit the bug? Usually I’m sure enough that I don’t think about it, but once in a while I make a mistake but I don’t discover it until later on. After awhile, I started getting paranoid, especially when the test case is a little more involved. I want stash the changes to revert to the original code, but that means it’ll revert the test case as well. So often times I’d have to copy-paste the test case somewhere else, stash, copy the test case back, test, stash pop, then retest.

The solution to this problem, as I discovered later on, was a small change in the order of things. You need to write the test case first to make sure it hits the bug correctly, then put in the fix and retest. It saves a ton of work and it gives you a lot more confidence in the process.

Posted in Uncategorized | Tagged , , , | Leave a comment

The Right Tools

I had the pleasure of assembling a new bed frame yesterday using a wrench plus a phillips drill bit instead of a screwdriver. As you can probably imagine, it was loads of fun.

        +      !=  screwdriver

About 2 hours into the 3 hour fun-fest, I began serious contemplation of suicide as I held the bit with one hand and turned it using the wrench with the other. I started developing trick on how I can hold the bit in place with my thumb and then turn, but it only worked if I’m facing the right way in relation to the screw.

In situations like this, the obvious remedy is to go out and buy a freakin’ screwdriver. This is a solution very few people would argue against. Yet for some reason, in the world of software development and data analysis, people are often reluctant to buy, or even learn to use the right tools. Instead, packages upon band-aid packages of workarounds are built to lessen the pain just below the tolerance threshold. Productivity is lost, often without even being noticed in the first place.

On the other hand, when the tool is just right for the job, coding and data analysis can be fast, efficient, and fun.  This is a big part of the reason why I really love using (and by extension working on) pandas. To me, Excel or Matlab are like the wrench + drill bit. They’re not made for data science and anything but the most basic of tasks are very cumbersome to do. R is like the the manual screwdriver; with data.frame, zoo, xts, and a host of incredibly useful libraries, R is still the mainstay of data analysis. However, Python/Pandas is like the cordless screwdriver that never runs out of battery and has a head that automatically morphs into the right shape when presented with a screw, nail, or a hole that needs to be drilled.

Analyze All the Data!

I won’t go into too much detail about pandas (Wes does a much better job of that than me anyways), but it really comes down to performance + useful features + intuitive API. Performance in pandas is achieved by 1) being careful about not copying things if you don’t have to and 2) pushing critical paths down to Cython (e.g., a lot of vectorized operations). At a high-level, pandas stays useful and relevant by staying engaged with the community and looking at feature development with an eye on how data scientists think about and work with data. Things like the groupby engine, agile merging/reshaping/joining functions, and time-series functionality all grew out of experience actually solving problems that required those things. Finally, having an intuitive API is critical to getting a community of non-developers to adopt the library. I’ve heard things from non-computer science majors like “yeah I played around with pandas over the weekend and I found myself just guessing the right functions to call”.

Python/pandas is getting a lot of traction as a tool of choice when analyzing data. Along with essential tools like IPython, NumPy, SciPy, Cython, StatsModels and matplotlib, the scientific Python stack will eventually become the mainstay in a data scientist’s tool-belt.

Posted in Uncategorized | Tagged , , , , | Leave a comment

Fast Replace with Bottleneck

Looking at the Bottleneck library for ideas to speed up pandas Series/DataFrame.replace, I see a set of posted benchmarks with bottleneck’s replace at roughly 4 times as fast as the implementation using numpy.putmask (and numpy.isnan to create an intermediate masking array). Just to verify, I ran my own benchmarks. A little bit of a surprise to me, but the speed up only occurs if the value to be replaced is NaN.

Here’s the setup with a 1,000,000 element floating point array with a bunch of elements set to NaN and 0 which we will replace:

import bottleneck as bn
import numpy as np
arr = np.random.randn(1e6)
arr[np.random.randint(0, 1e6-1, 4e5)] = np.nan
arr[np.random.randint(0, 1e6-1, 4e5)] = 0

Look for NaN and replace:

a = arr.copy()
%timeit np.putmask(a, np.isnan(a), -1)
>> 100 loops, best of 3: 3.31 ms per loop

a = arr.copy()
%timeit bn.replace(a, np.nan, -1)
>> 1000 loops, best of 3: 783 us per loop

Look for other values and replace:

a = arr.copy()
%timeit np.putmask(a, a == 0, -2)
>>  1000 loops, best of 3: 1.62 ms per loop

a = arr.copy()
%timeit bn.replace(a, 0, -2)
>>  100 loops, best of 3: 2.47 ms per loop

I thought maybe it was numpy.isnan that’s the difference, but that’s not the case:

a = arr.copy()
mask = np.isnan(a)
%timeit np.putmask(a, mask, -1)
>>  100 loops, best of 3: 2.29 ms per loop

Have to look deeper at the Bottleneck/Numpy code but as it stands, bottleneck replace useful for fillna, but probably not for replacing arbitrary values.

Posted in Uncategorized | Leave a comment

How Quants Choose Technology

As a former buy-side-ish quant, I think we are terrible customers for builders and sellers of analytical tools. Quants are smart enough that proof-by-intimidation doesn’t work. What’s worse, quants are very particular and very peculiar in their needs. You built a factor attribution system? Well what about rolling up into 5 day or 10 day returns? How about using Newey-West corrections to eliminate the effect of auto-correlation? Hey let’s try taking our risk model, computing an ex-ante covariance matrix for our alpha factors, and then using that derived covariance matrix in a GLS regression. I’ve actually been on both sides of that question so I know how frustrating it can be, but also how important these things are.

On the other hand, quants are also great customers. Because what they need tend to be fairly advanced both in terms of computational and mathematical requirements, there are not a lot of tools out there that meet their main criteria for good software — reliability, flexibility, transparency, usability,  shrink-wrapping, and quality of support. If you are a good technologist who also happen to understand the domain problems, or if you’re a quant who can also code, there is a lot of value that you would be able to add.

Quants tend not to be expert software engineers (there is an important distinction between programmer and software engineer), which means that the NIH (Not-Invented-Here) syndrome is much less among quants. It’s not the architecture and design that rubs quants egos the wrong way, and there is no competitive tension that exists between an internal development team and a software vendor. Instead, what drives quants to want to build software in-house generally tend to be the fact that no software package currently available meet all  of the criteria listed above. Commercial packages like Palantir and ClariFi are inflexible and not transparent (really they’re not even targeted towards quants at all). Traditional analytical software like Matlab is difficult to use as soon as you move away from the core matrix computations. R and other open source software are big steps in the right direction, but R packages are scattered, unsupported, and isn’t nearly as readable (which goes into transparency and usability) as Python.

Out of the criteria listed above, reliability, quality of support, and shrink-wrapping tend to be solved problems. Transparency is also not too difficult and is only complicated by the need to balance between being open source (to customers) versus protecting IP. This tends to be a business problem that can be ameliorated by rapid innovation and consulting relationships to provide more value to the customer than they would obtain by stopping a subscription and building in-house extensions to an older version of your software. Usability is a more difficult problem in that it requires a good deal of domain knowledge. What techniques do quants generally like to use? What are sensible default arguments? What are the most common ways to chain together statistical tools? These are all questions that can only be answered by someone with solid quantative finance experience. And the answers will only be useful for someone with solid skills in software engineering. The final criteria, flexibility, is a fairly complex problem that requires careful consideration. There’s not enough room in a blog to go into details, but two things are important here. One is being able to handle a large number of parameters. What defaults do you set? Do you use a separate configuration object? Do you use variable parameter magic? How do you make sure users who need to do simple tasks aren’t drowning in a sea of parameters that they don’t need to set and shouldn’t have to stress about reading all the documentation? The other issue that is important is how do you structure a flexible framework so that it makes standard tasks very simple, but still leaves room open for the user to leave the framework at any point, add their own customizations, and then come back into the framework?

It may be a bold claim I made that nothing meets the needs of quants, but I think the evidence supports my conclusion very well. This is based on two underlying claims, what exactly quants need in terms of technology, and whether currently available software meets those needs. Are these two claims valid? You be the judge.

Posted in Technology | Tagged , , , , , , , , , | Leave a comment

Better Visualizations

  

(chart by Elizabeth Fosslien)

 

I ran across an interesting article on the Freakonomics blog about the life of a quantitative analyst. A lot of them are very amusing, but the meta-chart above caught my eye in particular.

My personal experience meshes well with the idea she presents in the chart, and that is that quantitative researchers spend way too much time on formatting. Part of it may be that quants tend to be OCD perfectionists, but more importantly, Excel really stinks as a data visualization tool.

This is what the default Excel table looks like:

This is what we want it to look like:

This is what a default line chart looks like:

This is what we want it to look like:

Configuring the table is generally easy but configuring the chart is often quite painful. Formatting the table for this example took me about a minute to get the basics and another additional 4 minutes to get the right column width and align the column header towards the right but still using an indent to leave some space. Formatting the chart for this example took something like 10 minutes of time total as I tried many iterations of axis scales and label orientations. If you’re very proficient with Excel you can probably cut down on that time, but then again, how many quants are Excel experts? And that was just one chart and one table! When you are working with quantitative data, every presentation, no matter how small, yields at least several tables and charts. These configuration times really add up.

Why do Excel suck so bad? Because it does the exact opposite of all the basic data visualization principles outlined in Edward Tufte’s seminal work Beautiful Evidence, one particularly important one being maximizing the ratio of amount of relevant information to the amount of “ink”. Default charts and tables in Excel are a jumbled mess that is always difficult to read. It’s not all Microsoft’s fault. When you paste data into Excel, it is difficult to interpret how the user would want to visualize it. These types of inferences often gets very messy and may do more harm than good. However, a good quant research tool will be able to help the researcher cut down the amount of formatting time by automating the formatting process with sensible default configurations and a flexible interface to customize the formats.

For tables, this means:

-NO GRIDLINES!
-Bold/Solid border around the outside
-Lighter border separating the column/row headers
-Bold the headers
-Fill the header cells with color
Number formatting:
    -Generic floating point can be 2-3 decimal places.
    -Share quantities should be integers
    -Portfolio weights should be at most 2 places
    -Prices should be 2 places
    -Currency rates up to 5 places
    -Large amounts should be scaled down by 1e3 or 1e6
-Cell alignment so digits align down each row
-Column widths and row heights are set properly
-If headers have multiple levels, merge at higher levels
Correlation matrix:
    -Diagonal should be 1 or omitted (or 100%).
    -Rest of the matrix should be 1 or 2 decimals.
-Table title centered over the table with merged cells
-Conditional formatting for max/min/errors.

For generic charts, this means:

-No outside border
-No inside border
-White fill the chart area
-No fill plot area (inside area which is grey by default)
Axis:
    -Dashed and light gridlines for each axis
    -No or light axis lines
    -No or light axis tick marks
    -Axis number format with fewest places possible
    -Axis font size small but easily readable 
    -Label orientation doesn’t interfere with data
    -May need to move the intercept
Title:
    -Always have a title
    -Title font should be bolded or larger in size
Legend:
    -Legend optional. If only one data series, then no legend.
    -Legend should have white fill with light or no border
    -Legend goes where there is the least amount of data

For line charts:

-Line colors should be easily readable
-Line colors shouldnt’ be too bright
-Line weight should be heavier so lighter colors show up better
-No tick marks unless absolutely necessary

Other chart types:

Scatter plots should pay particular attention to scaling, “dot” color and size”
Bar charts should have light or no borders around the bars and use a light or no border between x-axis values

There are hundreds of additional configuration properties you can set to make your data look beautiful (while staying informative!). The ones listed above are simplest some of the most common ones that take up a lot of time to change from the default. In newer versions of Excel, it is possible to change the defaults, but in a corporate settings where you will have multiple workstations and/or need large scale collaboration and teamwork, a local only set of settings will soon become very inconvenient.

There are lots of visualization tools out there in the form of Tableau and such. However, a good visualization tool cannot stand alone. It is nothing without the ability to easily change and manipulate data, and any piping driven by the visualization side is both cumbersome and also too much of a dependency on just the visualizer. Instead, the automation must be driven from the data analytics tools and can output something to Excel so a particularly particular quant has the option to put on a few finishing touches.

Posted in Data Visualization, Uncategorized | Tagged , , , , , | Leave a comment

Multiperiod Attribution

As we saw in my previous post on performance attribution, the returns of a quantitative portfolio that is a the result of a factor model can be decomposed via straightforward regression analysis of the asset returns on the alpha factors and/or risk factors. If we analyze the left hand side of the regression a little more carefully, it’s easy to see that the portfolio contributions of each asset is simply the product of the asset return and the weight of that asset in the portfolio at the beginning of the return period. By regressing the contributions on the factors, we are assuming a set relationship between the weight of the asset in the portfolio and the combination of risk factors and alpha factors. Thus we see that this methodology implicitly assumes a fixed relationship between portfolio weights and the factors.

But what if we have to explain the returns of a portfolio over multiple rebalancing periods? Because the portfolio weights are adjusted during every rebalance, the relationship between the portfolio weights and the factors will change over time. Hence, we cannot assume that the total asset returns are due to a single set of portfolio weights. We must turn to something different.

Summing Up Single Period Attributions

The most obvious way to analyze multiperiod returns is to run a single period attribution between each portfolio rebalance and aggregate the contributions. However, this method has many moving pieces. Depending on whether the trading level is reset in every period, compounding may or may not be valid. Some practitioners like to use geometric returns for easier aggregation while others like to stick with simple returns for ease of interpretation and add various smoothing factors to combine returns. Moreover, because the single period BHB analysis doesn’t account for transactions related items, the aggregated return generally does not equal the whole period return.

The final nail in the coffin is that with this method we don’t have a sense for our overall exposure to various factors within the period. For very stable portfolios whose factor exposures are relatively stable over time, this is not such a big deal. But for portfolios that turnover quickly or are constructed from high turnover factors, this is a non-trivial problem. We could approximate the average exposures with a portfolio-value-weighted mean of the factor exposures, but in many instances a fund manager doesn’t necessarily rebalance on a fixed and uniform schedule. Thus we would also have to take the number of days into account. This gets very messy.

Return Time-Series Attribution

An alternative to summing up the results from the cross sectional regressions is to use a time-series regression. Instead of using a fundamental factor model paradigm where we have known factor exposures in every period, we could use a macroeconomic factor model paradigm where we have know factor returns but unknown factor exposures. Under this paradigm, we can take the time-series of overall portfolio returns and regress them on the time-series of factor returns. Factor returns are usually based on the return of the factor portfolio. We can use the simple factor portfolio return, or we could estimate the factor return by taking the top and bottom quintiles of the factor portfolio.

Once we have the exposures, the factor contributions are straightforward to compute. Just take the cumulative returns of the factor during the entire attribution window and multiply it by the estimated realized factor exposure. The noise component then is the difference between the cumulative portfolio return and the sum of all the factor contributions.

Pitfalls

First and foremost is the sample size issue. The Barra risk model has something like 68 factors, and if we incorporate our own alpha model, the number of factors can easily get up to 80 or 90. Even if we are doing daily return attributions, it won’t be possible to do monthly attributions. The problem is even worse if our factors generally act over longer time horizons than just a day. For daily returns, it may be possible to regress out the style factors from the alpha factors and then decompose the returns for a month (~20 observations) on the residualized alpha factors. But even that means you have to have less than 20 factors. If you have multiple strategies built on the same factors, it maybe possible to setup a panel regression that gets around this problem of sample size, but it’s unclear whether it’s sensible to add this much complexity into the process when the cross-sectional method is available to us as a simple straight-forward alternative.

Another problem with the time-series method is autocorrelation. If we use daily returns, it is fairly reasonable to assume that we maybe violating the no-autocorrelation assumption in OLS. This means we also have to perform additional testing using Durbin-Watson or Breusch-Godfrey tests. The additional testing is especially crucial here because of the sample size issue, since the presence of non-spherical disturbances can make a non-trivial impact on the efficiency of the OLS estimator.

The last problem with the time-series method is that we cannot guarantee that we can reconcile it with sub-period or single period attributions. This problem is a deal-breaker in many situations when we want to look at overall return decompositions and then drill down into sub-periods that had significant contributions.

Conclusion

The cross sectional return decomposition methods are more appropriate for equities portfolios because in most macro strategies the cross section tend to be very small. Decomposing time-series of returns to obtain loadings on various factors or assets is a more comprehensive view on the characteristics of your portfolio over a time window.

 

Posted in Performance Attribution | Tagged , , , , , , , , , , | Leave a comment

Performance Attribution Revisited

Analyzing the performance of an investment strategy is an integral part of the investment process. Systematic decomposition of investment performance began with a seminal study by Brinson, Hood, and Beebower (BHB) where they breakdown a portfolio’s returns into several parts that disentangled timing, stock selection, and benchmark performance. Many variations soon followed that allowed for similar decompositions of performance that helped disentangle country selection, industry selection, and miscellaneous other groupings. The goal of this type of decomposition was to help investor gauge manager skill and for managers to analyze where their competencies are.

BHB Method

The original BHB analysis broke down performance in terms of timing and security selection for an active portfolio measured against a benchmark. Under BHB, the benchmark group return times the active weight is considered the asset allocation component, the portfolio weights times the active return is considered the stock selection component, and the remaining component of the active return is called the interaction component.

Factor Decomposition

The fundamental process that drove the BHB analysis, and indeed performance attribution in general, is two fold: 1) separate out parts of a portfolio return that does not require skill (benchmark) and 2) analyze the performance along various dimensions (allocation vs selection). Both processes have evolved dramatically since its original inception.

With the advent of quantitative investment and factor models, performance attributions along factor dimensions also became a common occurrence. This added a great deal of complexity to the problem. We wish the measure the contributions from each factor, but simply taking the dot product of the returns and the weights won’t suffice because the factor portfolio return will include the effects from other factors as well. Instead, we make use of regression analysis to disentangle all the factor. Suppose we had a vector r of asset contributions to portfolio return and a matrix X of factor exposures, then we could calculate the orthogonalized factor returns according to this linear model:

where we solve for β via OLS:

Once we have the factor returns, it’s a simple step to take the dot product of the factor exposures to the factor returns to obtain the factor contributions to portfolio return.

Finally, residuals of this regression are considered to be return contributions from market noise, which on average should equal zero if the assumptions behind OLS are met or if the weights correctly accounts for the heteroskedasticity.

Factor Decomposition and Risk

Because a factor risk model is also an indispensable part of the quantitative investing process, we must also take into account risk factor exposures as well as the alpha factor exposures. This helps the manager clarify and control their risk along dimensions based on investing style, sector membership, and country membership. It also helps investors see whether the returns are truly coming from alpha factors or from accidental bets of risk factors that are presumed to carry no return over the long run.

Thus we must extend or factor decomposition to include the risk factors as well:

where Ω signifies the risk model

This approach has one drawback, namely that if the alpha factors are correlated with the risk factors, then we run the risk of multicollinearity issues. In a paper via Barra, Menchero and Poduri make the suggestion to regress the risk factors on the alpha or custom factors and use the residuals in the regression. However, this method is not useful for disentangling the parts of a manager’s strategy that simply coincides with risk factors that do not carry sustainable returns.

If we wanted to be really strict and look at strategy alpha above and beyond not just the market but also in excess of the risk factors, then we may want to reverse the roles of the two types of factors and regress the alpha factors on the risk factors:

And the residuals from this regression will be used in our new factor decomposition model:

where


The factor returns from the alpha factor part of the regression now contains only the return contributions that cannot be explained by the risk factors.

Optimization Constraints and Factor Decomposition

One could argue that because of the constraints in the optimization process, bets along the risk factors is inevitable, because we simply cannot achieve 100% of our ideal view portfolio. The short response here is that from the investor’s point of view, it doesn’t matter what the constraints are, what matters is the final value added. But a more nuanced response should recognize that many constraints are client mandated and are often customizable. Thus it is important to be cognizant of the magnitude of their impact.

A proper treatment of constraint decomposition and it’s relationship with performance attribution requires a dedicated post. But I’ll make two suggestions. One way to get a rough sense of how much the constraints are changing the performance attribution is to either use the returns to a view portfolio (or to a theoretical unconstrained portfolio) and look at the differences in factor contributions between that portfolio’s theoretical returns and the actual portfolio returns (need to pay attention to t-costs here). Another way to do this maybe to include the “delta” portfolios associated with each constraint as additional factors in the decomposition regression.

Conclusions

We have progressed much since the days when all we could distinguish was market performance and portfolio performance. With BHB and its variants, we were able to breakdown performance due to timing, stock selection, industry selection, country selection, and passive benchmark positioning. As quantitative investing and factor models came into the fore, we have moved onto using regression analysis to decompose returns amongst various factors. And finally, using a factor risk model and some simple modifications, we can also strip out the return contributions from bets along risk factor dimensions that don’t carry returns.

Posted in Performance Attribution | Tagged , , , , , , , , , | 3 Comments