MTA Turnstile Data

Analyzing MTA Turnstile Data to Strike more Effectively

Start Date End Date Duration Members
01/19/2016 01/25/2016 1 week 4


For this project, we used New-York MTA turnstile data to determine which subway line is most used during the month.


Data was collected from the New-York MTA turnstile data at: MTA-Turnstile-data

Tools Used

Use Tool
Code Python


Turnstiles had cumulative counts and thus the exact number of individuals that went through each turnstile during a day had to be computed.

Some turnstiles had negative cumulative counts. To compensate for this we decided to take the absolute value where the count was no lower than -10 000 and anywhere else the count was set to 0.

We had to compute an estimate on how many individuals were taking a subway line at a given time. To do this we considered that for every station the same number of people took the different lines branching out from that station and in either direction.


The data was flawed and we didn't use it over a period long enough that the errors would dissapate.


To make this project fun we imagined we were a data science group attempting to get a contract with the New-York MTA Labor Union. We were therefore looking for the busiest line so that the Union could strike more effictevily.


I strongly encouraged our team to use classes which I believed would have allowed better analysis of our problem.


A conflict arouse when I attempted to get the group to code the Python script with classes. But the individuals in my group were not familiar with object-oriented programming.

What would I do differently

For this project I would have taken on a more systematic approach. In other words, build different functions to compute reliable information separately about our problem in order to build an efficient analytical solution.