Data Analysis for Cyclistic
A Google Coursera Data Analytics Capstone Project
03
Data Cleaning and Manipulation
I began my data exploration with Excel. I discovered that the Quarter 2 and Quarter 3 files contained more than the 1,048,576 records that a single Excel file can handle. This was the first indication that my analysis would require a more powerful tool.
A possible solution is to use Excel’s Power Query to access the additional records.

The Power Query menu allows you to remove records from a dataset before importing the data into a spreadsheet. By removing the top 1,048,576 records, I was able to import and save a new spreadsheet with the overflow rows.
After saving the overflow data, I had six different Excel files. I was worried that having to manually repeat the cleaning process six times could introduce errors into the dataset, so instead, I decided to combine my dataset into a single CSV document and clean it with Python.
Cleaning with Python
Python's CSV module is my friend. Using Python scripts, I was able to clean and manipulate the data in the following ways:
-
Combined the quarterly datasets into a single CSV document.
-
Standardize column names.
-
Removed the superfluous BikeID column.
-
Trimmed whitespace for the entire column with STRIP().
-
Converted data stored in the trip_length column from strings to integers.
-
Converted data stored in the birthyear column from floats to integers.
-
Created a new column named day_of_week that used the WEEKDAY() function to retrieve the weekday from the start_time column.
-
Created a new column that calculated trip duration from the start_time and end_time columns, and then compared this column to the already existing trip_length column to find any inconsistencies (happily, there were none).
The Python scripts I used were created with the help of Claude AI. They can be found here.
One challenge that arose during data cleaning was that all trip lengths over 1,000 seconds were recorded as strings to prevent their commas from being interpreted as delimiters. This would make it impossible to perform calculations on the trip_length field. Further complicating matters is that Python's CSV module also interprets quotes as delimiters, so my initial scripts for converting the strings to integers were unsuccessful. Claude AI was able to write code to overcome this obstacle, but this highlighted a gap in my knowledge I will need to return to later.
Sample Rows
Here is a sample from my cleaned table. You can also download it as a CSV.
trip_id | start_time | end_time | trip_length | day_of_week | from_station_id | from_station_name | to_station_id | to_station_name | usertype | gender | birthyear |
|---|---|---|---|---|---|---|---|---|---|---|---|
22082920 | 19/03/2019 | 19/03/2019 | 472 | Tuesday | 654 | Racine Ave & Washington Blvd (*) | 174 | Canal St & Madison St | Subscriber | Female | 1970 |
22082921 | 19/03/2019 | 19/03/2019 | 560 | Tuesday | 638 | Clinton St & Jackson Blvd (*) | 212 | Wells St & Hubbard St | Subscriber | Male | 1981 |
22082922 | 19/03/2019 | 19/03/2019 | 206 | Tuesday | 77 | Clinton St & Madison St | 18 | Wacker Dr & Washington St | Subscriber | Male | 1962 |
22082923 | 19/03/2019 | 19/03/2019 | 182 | Tuesday | 99 | Lake Shore Dr & Ohio St | 635 | Fairbanks St & Superior St (*) | Subscriber | Female | 1985 |
22082924 | 19/03/2019 | 19/03/2019 | 105 | Tuesday | 302 | Sheffield Ave & Wrightwood Ave | 67 | Sheffield Ave & Fullerton Ave | Subscriber | Male | 1984 |
22082925 | 19/03/2019 | 19/03/2019 | 373 | Tuesday | 192 | Canal St & Adams St | 71 | Morgan St & Lake St | Subscriber | Female | 1989 |
22082926 | 19/03/2019 | 19/03/2019 | 350 | Tuesday | 247 | Shore Dr & 55th St | 425 | Harper Ave & 59th St | Subscriber | Female | 1989 |
22082927 | 19/03/2019 | 19/03/2019 | 632 | Tuesday | 174 | Canal St & Madison St | 283 | LaSalle St & Jackson Blvd | Subscriber | Male | 1963 |
22082928 | 19/03/2019 | 19/03/2019 | 2102 | Tuesday | 201 | Indiana Ave & 40th St | 174 | Canal St & Madison St | Subscriber | Male | 1984 |
22082929 | 19/03/2019 | 19/03/2019 | 164 | Tuesday | 402 | Shields Ave & 31st St | 403 | Wentworth Ave & 33rd St | Subscriber | Male | 1991 |
Data Integrity Questions
While cleaning and exploring the data, I was able to answer some questions about the data's integrity, but a few questions remained.
-
The number of trip starts matched the number of trip ends. This is how it should be, but it was satisfying to confirm.
-
Every trip has a value recorded for usertype, from_station_name, to_station_name, start_time, and end_time. There are zero null entries for any of these fields.
-
There were no trips under a minute. How come? There were 28899 trips under two minutes, (26702 by subscribers and 2197 by casual riders).
-
The dataset includes 1848 trips over 24 hours, and even 418 trips over a week long! The longest trip in the dataset is 123 days long. Presumably some of these trips are errors, although there is no objective way to precisely define a "real trip" by its trip length.
-
I've decided that it is easier and safer to deal with outliers by filtering them out of my queries rather than deleting their entries from the dataset.
-
-
The gender and birthyear were recorded for 99.8% of subscriber trips. This makes sense because that data is collected when the membership is created. However, gender and birthyear were also recorded for 39.5% of casual rider trips. How would Cyclistic have gathered this data? And why couldn't they collect it for the remaining 60.5%?
Go To Analysis.
