Data Analysis for Cyclistic

A Google Coursera Data Analytics Capstone Project

Data Cleaning and Manipulation

I began my data exploration with Excel. I discovered that the Quarter 2 and Quarter 3 files contained more than the 1,048,576 records that a single Excel file can handle. This was the first indication that my analysis would require a more powerful tool.

A possible solution is to use Excel’s Power Query to access the additional records.

The Power Query menu allows you to remove records from a dataset before importing the data into a spreadsheet. By removing the top 1,048,576 records, I was able to import and save a new spreadsheet with the overflow rows.

After saving the overflow data, I had six different Excel files. I was worried that having to manually repeat the cleaning process six times could introduce errors into the dataset, so instead, I decided to combine my dataset into a single CSV document and clean it with Python.

Cleaning with Python

Python's CSV module is my friend. Using Python scripts, I was able to clean and manipulate the data in the following ways:

Combined the quarterly datasets into a single CSV document.
Standardize column names.
Removed the superfluous BikeID column.
Trimmed whitespace for the entire column with STRIP().
Converted data stored in the trip_length column from strings to integers.
Converted data stored in the birthyear column from floats to integers.
Created a new column named day_of_week that used the WEEKDAY() function to retrieve the weekday from the start_time column.
Created a new column that calculated trip duration from the start_time and end_time columns, and then compared this column to the already existing trip_length column to find any inconsistencies (happily, there were none).

The Python scripts I used were created with the help of Claude AI. They can be found here.

One challenge that arose during data cleaning was that all trip lengths over 1,000 seconds were recorded as strings to prevent their commas from being interpreted as delimiters. This would make it impossible to perform calculations on the trip_length field. Further complicating matters is that Python's CSV module also interprets quotes as delimiters, so my initial scripts for converting the strings to integers were unsuccessful. Claude AI was able to write code to overcome this obstacle, but this highlighted a gap in my knowledge I will need to return to later.

Sample Rows

Here is a sample from my cleaned table. You can also download it as a CSV.

trip_id	start_time	end_time	trip_length	day_of_week	from_station_id	from_station_name	to_station_id	to_station_name	usertype	gender	birthyear
22082920	19/03/2019	19/03/2019	472	Tuesday	654	Racine Ave & Washington Blvd (*)	174	Canal St & Madison St	Subscriber	Female	1970
22082921	19/03/2019	19/03/2019	560	Tuesday	638	Clinton St & Jackson Blvd (*)	212	Wells St & Hubbard St	Subscriber	Male	1981
22082922	19/03/2019	19/03/2019	206	Tuesday	77	Clinton St & Madison St	18	Wacker Dr & Washington St	Subscriber	Male	1962
22082923	19/03/2019	19/03/2019	182	Tuesday	99	Lake Shore Dr & Ohio St	635	Fairbanks St & Superior St (*)	Subscriber	Female	1985
22082924	19/03/2019	19/03/2019	105	Tuesday	302	Sheffield Ave & Wrightwood Ave	67	Sheffield Ave & Fullerton Ave	Subscriber	Male	1984
22082925	19/03/2019	19/03/2019	373	Tuesday	192	Canal St & Adams St	71	Morgan St & Lake St	Subscriber	Female	1989
22082926	19/03/2019	19/03/2019	350	Tuesday	247	Shore Dr & 55th St	425	Harper Ave & 59th St	Subscriber	Female	1989
22082927	19/03/2019	19/03/2019	632	Tuesday	174	Canal St & Madison St	283	LaSalle St & Jackson Blvd	Subscriber	Male	1963
22082928	19/03/2019	19/03/2019	2102	Tuesday	201	Indiana Ave & 40th St	174	Canal St & Madison St	Subscriber	Male	1984
22082929	19/03/2019	19/03/2019	164	Tuesday	402	Shields Ave & 31st St	403	Wentworth Ave & 33rd St	Subscriber	Male	1991

Data Integrity Questions

While cleaning and exploring the data, I was able to answer some questions about the data's integrity, but a few questions remained.

The number of trip starts matched the number of trip ends. This is how it should be, but it was satisfying to confirm.
Every trip has a value recorded for usertype, from_station_name, to_station_name, start_time, and end_time. There are zero null entries for any of these fields.
There were no trips under a minute. How come? There were 28899 trips under two minutes, (26702 by subscribers and 2197 by casual riders).
The dataset includes 1848 trips over 24 hours, and even 418 trips over a week long! The longest trip in the dataset is 123 days long. Presumably some of these trips are errors, although there is no objective way to precisely define a "real trip" by its trip length.
- I've decided that it is easier and safer to deal with outliers by filtering them out of my queries rather than deleting their entries from the dataset.
The gender and birthyear were recorded for 99.8% of subscriber trips. This makes sense because that data is collected when the membership is created. However, gender and birthyear were also recorded for 39.5% of casual rider trips. How would Cyclistic have gathered this data? And why couldn't they collect it for the remaining 60.5%?

Go To Analysis.

Data Analysis for Cyclistic

Data Cleaning and Manipulation

Cleaning with Python

Python's CSV module is my friend. Using Python scripts, I was able to clean and manipulate the data in the following ways:

Combined the quarterly datasets into a single CSV document.

Standardize column names.

Removed the superfluous BikeID column.

Trimmed whitespace for the entire column with STRIP().

Converted data stored in the trip_length column from strings to integers.

Converted data stored in the birthyear column from floats to integers.

Created a new column named day_of_week that used the WEEKDAY() function to retrieve the weekday from the start_time column.

Created a new column that calculated trip duration from the start_time and end_time columns, and then compared this column to the already existing trip_length column to find any inconsistencies (happily, there were none).

Sample Rows

Data Integrity Questions