Google Analytics integration in pandas

We just released 0.10.0 of pandas. One of the new features in this release is integration with Google Analytics (GA). For those of you who need to download GA data and do custom analysis in pandas, this should make your life a little easier. This modules relies on oauth2 to access data, so if you don’t have it setup, jump to the end of this article to see a brief setup guide.

The interface lives in the module pandas.io.ga and the main function is read_ga:

In [18]: import pandas.io.ga as ga

In [19]: df = ga.read_ga(['visits', 'avgTimeOnSite'], 
                         dimensions=['date', 'hour'], 
                         start_date='2012-10-25')
In [20]: df
Out[20]: 
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 1296 entries, (2012-10-25 00:00:00, 0) to (2012-12-17 00:00:00, 23)
Data columns:
visits           1296  non-null values
avgTimeOnSite    1296  non-null values
dtypes: float64(1), int64(1)

This query gets the number of visits and the average visit duration by date and hour since 2012-10-25 and returns the results in a DataFrame with ‘visits’ and ‘avgTimeOnSite’ as columns. The dimension variables ‘date’ and ‘hour’ have been set as the index. For a complete list of valid variable names, visit the google analytics developer guide.

But wait, the date and the hour of day are used as separate variables and the index is a hierarchical index. What if you want to combine the two into actual timestamps so you can take advantage of pandas timeseries API? Just like in read_csv, you can specify the parse_dates keyword:

In [25]: df = ga.read_ga(['visits', 'avgTimeOnSite'],
                         dimensions=['date', 'hour'],
                         start_date='2012-10-25',
                         parse_dates=[['date', 'hour']])
In [26]: df.head()
Out[26]:
visits avgTimeOnSite
date_hour
2012-10-25 00:00:00 1 37
2012-10-25 01:00:00 1 0
2012-10-25 02:00:00 0 0
2012-10-25 03:00:00 0 0
2012-10-25 04:00:00 4 0

And now I’ve got a DatetimeIndex so I can start doing resample, betwee_time and all the other timeseries API goodness.

If you have a LOT of data, you may not want to make a big request all at once, so there is a keyword called chunksize that will make read_ga return an iterator that yields DataFrames instead of a single DataFrame:

In [28]: it = ga.read_ga(['visits', 'avgTimeOnSite'], 
                         dimensions=['date', 'hour'], 
                         start_date='2012-10-25', 
                         parse_dates=[['date', 'hour']], 
                         chunksize=10)

In [29]: it.next()
Out[29]: 
                     visits  avgTimeOnSite
date_hour                                 
2012-10-25 00:00:00       1      37.000000
2012-10-25 01:00:00       1       0.000000
2012-10-25 02:00:00       0       0.000000
2012-10-25 03:00:00       0       0.000000
2012-10-25 04:00:00       4       0.000000
2012-10-25 05:00:00       0       0.000000
2012-10-25 06:00:00       0       0.000000
2012-10-25 07:00:00       3      30.666667
2012-10-25 08:00:00       0       0.000000
2012-10-25 09:00:00       4       0.000000

In [30]: it.next()
Out[30]: 
                     visits  avgTimeOnSite
date_hour                                 
2012-10-25 10:00:00       8      20.625000
2012-10-25 11:00:00       6      41.666667
2012-10-25 12:00:00       3     451.000000
2012-10-25 13:00:00       2      28.000000
2012-10-25 14:00:00       4     441.750000
2012-10-25 15:00:00       5      50.000000
2012-10-25 16:00:00       2     118.500000
2012-10-25 17:00:00       6      35.000000
2012-10-25 18:00:00       2       4.500000
2012-10-25 19:00:00       0       0.000000

Setup guide:

For those of you who don’t have Google Analytics setup with oauth2, you need to do a few things first before you can start accessing your data.

First, you need to go to the Google API console and create a new project. When you create the new project, make sure you turn on the “Analytics API” as a service you need. Next, click “API Access” and click the big button that says “Create OAuth2.0 client ID”. Make sure you select “Installed application” in the last step. Once the client ID has been created, you should see your client ID, secret, and redirect URIs (should be oob). Then click “Download JSON” on the right and you’ll download a “client_secrets.json”. By default, the client secrets should be in the same directory as pandas.io.ga but you can specify the secrets keyword in read_ga to the location of your secrets file.

Advertisements

About Chang She

Engineer @ Cloudera. Ex-cofounder/CTO @ DataPad. Builder of data tools. Recovering financial quant.
This entry was posted in Uncategorized and tagged , , , . Bookmark the permalink.

27 Responses to Google Analytics integration in pandas

  1. Kyle says:

    Hmmm, did not work for me. I’ve probably missed a step somewhere:

    IOError: [Errno 13] Permission denied: ‘/Library/Frameworks/Python.framework/Versions/7.3/lib/python2.7/site-packages/pandas-0.10.1-py2.7-macosx-10.5-i386.egg/pandas/io/analytics.dat’

    Your browser has been opened to visit:

    https://accounts.google.com/o/oauth2/auth?scope=https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fanalytics.readonly&redirect_uri=http%3A%2F%2Flocalhost%3A8080%2F&response_type=code&client_id=259763097932.apps.googleusercontent.com&access_type=offline

    If your browser is on a different machine then exit and re-run this
    application with the command-line parameter

    –noauth_local_webserver

    • Chang She says:

      Hi Kyle,

      by default the token store is called analytics.dat and is in the pandas/io directory. What you probably need to do is use the ‘token_file_name’ keyword in the ‘read_ga’ function and specify a path that does not require sudo.

  2. Kyle says:

    Many thanks for the tip, Chang.

  3. Chang, Thanks for posting this information. Is there any other Pandas Google Analytics documentation?

  4. mgcdanny says:

    Hi – This is totally awesome. Can someone comment as how to use boolean logic in the filters and segments ? For example ga:country==United%20States,ga:country==Canada versus ga:country==United%20States;ga:country==Canada where the only difference is the use of comma or semi-colon to indicate OR and AND respectively.

    • Chang She says:

      read_ga passes in a simple list to the underlying google library being used and I don’t recall how to specify AND vs OR relations in that library. I made a pandas github issue here: https://github.com/pydata/pandas/issues/3505 happy to continue discussion on there if you have any insight or pointers.

      • Matt says:

        More than a little old, but did this get sorted out? I ask as I seem to be doing it wrong and was going to build a solution now, but thought I should first check if perhaps I’m simply just not forming ‘filters’ in the appropriate manner.

  5. Are there plans to make the Google Analytics Integration in Pandas for Python3 too? I tried running the Python 33 pandas-0.10.1 (64) and received this error:

    “C:\Python33\lib\site-packages\apiclient\discovery.py”, line 202
    except ValueError, e:
    ^
    I used Python/Pandas 2.7 and it works like a charm:)

    • Chang She says:

      I wonder if that’s the only compat issue. Definitely need to make it work with Python3. I think the issue is just figuring out a testing strategy here. Most likely I’ll just set things up on our windows jenkins box. I made a pandas github issue here: https://github.com/pydata/pandas/issues/3507 , please feel free to leave addition feedback on this issue.

  6. I was wondering if it was possible to see if the data I’m receiving is sampled? I don’t see any obvious places in ga.py where I could add containsSampledData which returns a boolean. Do you have any suggestions?
    Google documentation on containsSampledData – https://developers.google.com/analytics/devguides/reporting/core/v3/reference#sampling

    Thanks again for your responses and post.

    • Chang She says:

      Would love to get a PR on that. Made another github issue here: https://github.com/pydata/pandas/issues/3514
      I think the first place I would start is GDataReader.get_data. there’s the `_read` nested function which contains a call to execute the query object from google’s library. Pandas is looking for the results in the ‘rows’ attribute, I bet containsSampledData is also in there.

      • Paige says:

        I tried:
        from pandas.io.ga import GAnalytics
        gAnalytics = GAnalytics()
        z = gAnalytics.create_query(‘web id’, ‘pageviewsPerVisit’, ‘2013-01-13′,’2013-08-27’, ‘date’).execute().get(‘containsSampledData’)

        But I can’t seem to get segments to work. I get an error “Invalid value ‘ga:1234’ for segment.”. I thought I should be able to add segments to this query? Also, not 100% sure this code is working correctly since my results have all been “false.”

      • Chang She says:

        That looks like an exception from the underlying Google API? I haven’t looked at this code in a while. Segmented reading is part of the unit tests so I’d be surprised if there was a bug there in the pandas code, but I’d highly encourage you to go to github.com/pydata/pandas/issues/new and fill out a bug report. There are a lot of active pandas developers now so in case I can’t get to it this ensures it doesn’t get lost. Thanks!

    • Paige says:

      In case anyone else is interested I’ve filed https://github.com/pydata/pandas/issues/4693

      • Paige says:

        Sorry to clog the replies, but here’s what I think allows a user to see if the data is sampled based on a few test cases I’ve run. When I originally posted my code I had segment in ‘, but I should not have.

        working code:

        from pandas.io.ga import GAnalytics
        gAnalytics = GAnalytics()
        gAnalytics.create_query(‘web_id’, ‘pageviews’, ‘2013-08-12′,’2013-08-14’, ‘date’, segment=1234).execute().get(‘containsSampledData’)

  7. Andrew says:

    I’m new to Python but have used Google Analytics API extensively with R. When I use the ‘chunksize’ parameter how do you combine all the dataframes back into one dataframe from the iterator?

    Thanks.

  8. James says:

    HI. Im sorry, but Im a bit of a newbie.
    When I run the tests from the API page:https://developers.google.com/gdata/articles/python_client_lib#macintosh
    Everything works fine.

    However, when I do this within iPython notebook or just the Python interpreter:
    import pandas
    import pandas.io.ga as ga

    I get this error:
    —————————————————————————
    ImportError Traceback (most recent call last)
    in ()
    1 import pandas
    —-> 2 import pandas.io.ga as ga

    /Library/Python/2.7/site-packages/pandas-0.12.0-py2.7-macosx-10.7-intel.egg/pandas/io/ga.py in ()
    12 import pandas.lib as lib
    13 from pandas.io.date_converters import generic_parser
    —> 14 import pandas.io.auth as auth
    15 from pandas.util.decorators import Appender, Substitution
    16

    /Library/Python/2.7/site-packages/pandas-0.12.0-py2.7-macosx-10.7-intel.egg/pandas/io/auth.py in ()
    6 import httplib2
    7
    —-> 8 import apiclient.discovery as gapi
    9 import gflags
    10 import oauth2client.file as auth_file

    ImportError: No module named discovery

    I moved the client_secrets.json file to the io folder, but I keep getting the same error.
    Perhaps, because of my lack of experience, Im doing something simple that is wrong.
    Ive exhaustively searched stackoverflow and the web to try to understand, but I cant seem to get it to work.
    Thanks

    • Chang She says:

      Hi James, the apiclient should be part of the google client library that you install (see the setup instructions in the post). I haven’t kept up with their updates, so it is possible they’ve changed the package. If you’ve made sure that you have it installed on your setup, please make a github issue at http://www.github.com/pydata/pandas/issues/new and someone on the dev team will look into it as soon as they can.

    • Erik says:

      James, I just ran into this problem. What was causing it for me was pretty silly. I installed a package called `apiclient` using `pip install apiclient`. This is _not_ the google api client. To get the google api client, you have to `pip install google-api-python-client`. You might first have to remove the `apiclient` package though, since it will shadow the google apiclient package.

  9. Pingback: Intro to pandas data structures | spider's space

  10. Pingback: Live Streaming Κάλυψη Συνεδρείων

  11. Pingback: Google Analytics, IPython and pandas | quizzicol

  12. I am not getting any object returned?
    >>> y = ga.read_ga(token_file_name=client_secrets,
    account_name=account_name,
    property_name=property_name,
    profile_name=profile_name,
    dimensions=[‘date’, ‘hour’, ‘minute’],
    metrics=[‘pageviews’],
    start_date=start_date,
    end_date=end_date,
    index_col=0,
    parse_dates={‘datetime’: [‘date’, ‘hour’, ‘minute’]},
    date_parser=lambda x: datetime.strptime(x, ‘%Y%m%d %H %M’),
    max_results=max_results)
    >>> y
    Traceback (most recent call last):
    File “”, line 1, in
    y
    NameError: name ‘y’ is not defined

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s