The Right Tools

I had the pleasure of assembling a new bed frame yesterday using a wrench plus a phillips drill bit instead of a screwdriver. As you can probably imagine, it was loads of fun.

        +      !=  screwdriver

About 2 hours into the 3 hour fun-fest, I began serious contemplation of suicide as I held the bit with one hand and turned it using the wrench with the other. I started developing trick on how I can hold the bit in place with my thumb and then turn, but it only worked if I’m facing the right way in relation to the screw.

In situations like this, the obvious remedy is to go out and buy a freakin’ screwdriver. This is a solution very few people would argue against. Yet for some reason, in the world of software development and data analysis, people are often reluctant to buy, or even learn to use the right tools. Instead, packages upon band-aid packages of workarounds are built to lessen the pain just below the tolerance threshold. Productivity is lost, often without even being noticed in the first place.

On the other hand, when the tool is just right for the job, coding and data analysis can be fast, efficient, and fun.  This is a big part of the reason why I really love using (and by extension working on) pandas. To me, Excel or Matlab are like the wrench + drill bit. They’re not made for data science and anything but the most basic of tasks are very cumbersome to do. R is like the the manual screwdriver; with data.frame, zoo, xts, and a host of incredibly useful libraries, R is still the mainstay of data analysis. However, Python/Pandas is like the cordless screwdriver that never runs out of battery and has a head that automatically morphs into the right shape when presented with a screw, nail, or a hole that needs to be drilled.

Analyze All the Data!

I won’t go into too much detail about pandas (Wes does a much better job of that than me anyways), but it really comes down to performance + useful features + intuitive API. Performance in pandas is achieved by 1) being careful about not copying things if you don’t have to and 2) pushing critical paths down to Cython (e.g., a lot of vectorized operations). At a high-level, pandas stays useful and relevant by staying engaged with the community and looking at feature development with an eye on how data scientists think about and work with data. Things like the groupby engine, agile merging/reshaping/joining functions, and time-series functionality all grew out of experience actually solving problems that required those things. Finally, having an intuitive API is critical to getting a community of non-developers to adopt the library. I’ve heard things from non-computer science majors like “yeah I played around with pandas over the weekend and I found myself just guessing the right functions to call”.

Python/pandas is getting a lot of traction as a tool of choice when analyzing data. Along with essential tools like IPython, NumPy, SciPy, Cython, StatsModels and matplotlib, the scientific Python stack will eventually become the mainstay in a data scientist’s tool-belt.


About Chang She

Engineer @ Cloudera. Ex-cofounder/CTO @ DataPad. Builder of data tools. Recovering financial quant.
This entry was posted in Uncategorized and tagged , , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s