We just released 0.10.0 of pandas. One of the new features in this release is integration with Google Analytics (GA). For those of you who need to download GA data and do custom analysis in pandas, this should make your life a little easier. This modules relies on oauth2 to access data, so if you don’t have it setup, jump to the end of this article to see a brief setup guide.
The interface lives in the module pandas.io.ga
and the main function is read_ga
:
In [18]: import pandas.io.ga as ga In [19]: df = ga.read_ga(['visits', 'avgTimeOnSite'], dimensions=['date', 'hour'], start_date='2012-10-25') In [20]: df Out[20]: <class 'pandas.core.frame.DataFrame'> MultiIndex: 1296 entries, (2012-10-25 00:00:00, 0) to (2012-12-17 00:00:00, 23) Data columns: visits 1296 non-null values avgTimeOnSite 1296 non-null values dtypes: float64(1), int64(1)
This query gets the number of visits and the average visit duration by date and hour since 2012-10-25 and returns the results in a DataFrame with ‘visits’ and ‘avgTimeOnSite’ as columns. The dimension variables ‘date’ and ‘hour’ have been set as the index. For a complete list of valid variable names, visit the google analytics developer guide.
But wait, the date and the hour of day are used as separate variables and the index is a hierarchical index. What if you want to combine the two into actual timestamps so you can take advantage of pandas timeseries API? Just like in read_csv
, you can specify the parse_dates
keyword:
In [25]: df = ga.read_ga(['visits', 'avgTimeOnSite'], dimensions=['date', 'hour'], start_date='2012-10-25', parse_dates=[['date', 'hour']]) In [26]: df.head() Out[26]: visits avgTimeOnSite date_hour 2012-10-25 00:00:00 1 37 2012-10-25 01:00:00 1 0 2012-10-25 02:00:00 0 0 2012-10-25 03:00:00 0 0 2012-10-25 04:00:00 4 0
And now I’ve got a DatetimeIndex
so I can start doing resample
, betwee_time
and all the other timeseries API goodness.
If you have a LOT of data, you may not want to make a big request all at once, so there is a keyword called chunksize
that will make read_ga
return an iterator that yields DataFrames instead of a single DataFrame:
In [28]: it = ga.read_ga(['visits', 'avgTimeOnSite'], dimensions=['date', 'hour'], start_date='2012-10-25', parse_dates=[['date', 'hour']], chunksize=10) In [29]: it.next() Out[29]: visits avgTimeOnSite date_hour 2012-10-25 00:00:00 1 37.000000 2012-10-25 01:00:00 1 0.000000 2012-10-25 02:00:00 0 0.000000 2012-10-25 03:00:00 0 0.000000 2012-10-25 04:00:00 4 0.000000 2012-10-25 05:00:00 0 0.000000 2012-10-25 06:00:00 0 0.000000 2012-10-25 07:00:00 3 30.666667 2012-10-25 08:00:00 0 0.000000 2012-10-25 09:00:00 4 0.000000 In [30]: it.next() Out[30]: visits avgTimeOnSite date_hour 2012-10-25 10:00:00 8 20.625000 2012-10-25 11:00:00 6 41.666667 2012-10-25 12:00:00 3 451.000000 2012-10-25 13:00:00 2 28.000000 2012-10-25 14:00:00 4 441.750000 2012-10-25 15:00:00 5 50.000000 2012-10-25 16:00:00 2 118.500000 2012-10-25 17:00:00 6 35.000000 2012-10-25 18:00:00 2 4.500000 2012-10-25 19:00:00 0 0.000000
Setup guide:
For those of you who don’t have Google Analytics setup with oauth2, you need to do a few things first before you can start accessing your data.
First, you need to go to the Google API console and create a new project. When you create the new project, make sure you turn on the “Analytics API” as a service you need. Next, click “API Access” and click the big button that says “Create OAuth2.0 client ID”. Make sure you select “Installed application” in the last step. Once the client ID has been created, you should see your client ID, secret, and redirect URIs (should be oob). Then click “Download JSON” on the right and you’ll download a “client_secrets.json”. By default, the client secrets should be in the same directory as pandas.io.ga
but you can specify the secrets
keyword in read_ga
to the location of your secrets file.