As a Metis Data Science Bootcamp student, our first project is to perform an exploratory data analysis on NYC MTA turnstile data. We have to come up with a potential client who might find this data analysis useful and with a backstory for the analysis.
On top of all the things we’re learning in the first week, it seems like an impossible task. But it ties in with the curriculum pretty well and gives us the chance to put all the skills we just learned the past two weeks into practice.
Disclaimer: I am new to machine learning and also to blogging (First). So, if there are any mistakes, please do let me know. All feedback appreciated.
My project’s backstory was built upon the recent proposal of MTA inviting private construction companies to work on station accessibility upgrades and increase the amount of “Americans with Disabilities Act-compliant” stations (ADA stations).
A fictional elevator construction company, Lift Inc., is looking into joining the program. They wanted a data analysis on the subway traffic flow during the pandemic to figure out their employee work schedule. Assuming the higher traffic stations would be prioritized for upgrades, the company wants to identify the times of days during a week at those popular stations so that they can minimize exposure and risk for their employees during the pandemic.
The datasets used in this project include a one-year of MTA turnstile data (from 03/21/2020 to 03/20/2021) which correlates with the start of NYC work from home order to the most recent available week, and MTA station information data that contains the ADA information of each station.
I retrieved those data set from the websites, loaded them into one single SQL database as separate tables for local data storage. With preliminary data cleaning and merging done with SQL, the datasets were accessed in Python with sqlalchemy.
Data Cleaning and Preparation
After getting the data in place, we’re ready to go!
At the first stage of data cleaning, I proceeded with standardizing all the column names, unmatched data types, and dropping some unnecessary rows and columns.
While exploring the data set, we have a few observations:
- Turnstile count is accumulative
- No fixed timestamp
After getting the counts from the difference between each timestamp at each turnstile, the initial analysis showed some error data such as negative counts or extremely high counts. To fix those error data, I used absolute values to get rid of the negative counts. For the outliers, I replaced them with mean values before and after the timestamps at the turnstile.
Then, I obtained the total traffic count by summing up the entry and exit counts.
The data also has multiple different timestamps which can be hard to compute with different time intervals. So I made the decision to unify the time intervals to 4-hour intervals.
Lastly, the data were grouped by each turnstile, station, and datetime to get a table of each station's count at each datetime. I did the last data cleaning by dropping unnecessary columns before saving it to a .csv file for further analysis.
For the second part of my analysis, I merged the station info data with the cleaned dataset to filter out the stations that are not ADA compliant and made visualizations on them as well.
Based on the ridership during the pandemic, I have found the distribution of traffic across all subway stations is heavily right-skewed. The top ten stations are outliers in the distribution, which would indicate MTA prioritizing those stations over the others.
Zooming in on the top ten stations as the barplot shown below, further shows Penn Station has notably more traffic than the rest.
Since the MTA is also looking to increase the amount of ADA stations, I isolated the non-ADA stations and identified the top ten among those stations. The distribution shows that compared to the original top ten stations, these stations have notably less traffic.
The top ten station traffic pattern across the week is demonstrated with heatmap below. It shows the trend that people are traveling by subway more frequently on the weekdays as compared to the weekends. Plotting time versus days reveals another interesting fact that: there is no more morning rush-hour crowd. The stations are busier in the afternoon.
Analyzing the top ten non-ADA stations' traffic patterns in a weekly and timely manner also shows a similar result as the original top ten stations. The trend may be due to the work from home arrangement during the pandemic.
For overall accessibility upgrades across the stations, the company would focus on the list of top ten popular stations. However, if the MTA is looking to increase the amount of ADA stations, the top ten popular non-ADA stations would be their options.
Given the traffic patterns of both lists of stations, it would be recommended that the company set the schedule on weekends, in the early morning, or at night shifts to best fit their needs.
- SQLite for creating database and tables, and joining tables
- sqlalchemy for accessing SQL database in Python
- Numpy and Pandas for data manipulation
- Matplotlib and Seaborn for plotting
- Bokeh for making interactive plot in future work
If I had more time, I would like to take a deep dive into the top ten stations and identify if the stations are in the denser residential verse commercial areas by analyzing trends on the net entry and exits in each station. As shown in the day vs. time heatmaps below for Penn Station and Flushing Main street. Penn Station has more people heading arriving during the daytime, indicating it’s more of a commercial area/major transit connection/or more human activities area. On the opposite, Flushing Main Street Station has more people leaving the area in the morning and heading back in the evening, which would indicate it’s a denser residential/hotel area.
Another idea is to do Geo-mapping to see which stations are in proximity to nursing facilities, hospitals, or areas with higher disability populations.
Lastly, interactive plots such as Plotly and Bokeh would be great for the potential client to use for investigating an area of interest in the data.
Before proceeding with data cleaning, it is helpful to take some time to explore the original data and understand it. This helps smooth the process of analyzing the data and enables you to generate insights more efficiently.
Overall, it was a great experience to be able to put our knowledge and skill into real-world practical scenarios. I hope this project was interesting and insightful for you.
You can find my project code here.