top of page

Data Analysis for Cyclistic 

 

A Google Coursera Data Analytics Capstone Project

03

Data Cleaning and Manipulation

I began my data exploration with Excel. I discovered that the Quarter 2 and Quarter 3 files contained more than the 1,048,576 records that a single Excel file can handle. This was the first indication that my analysis would require a more powerful tool.

A possible solution is to use Excel’s Power Query to access the additional records.

powerquery_example_1.png

The Power Query menu allows you to remove records from a dataset before importing the data into a spreadsheet. By removing the top 1,048,576 records, I was able to import and save a new spreadsheet with the overflow rows.

After saving the overflow data, I had six different Excel files. I was worried that having to manually repeat the cleaning process six times could introduce errors into the dataset, so instead, I decided to combine my dataset into a single CSV document and clean it with Python.

Cleaning with Python
Python's CSV module is my friend. Using Python scripts, I was able to clean and manipulate the data in the following ways:
  • Combined the quarterly datasets into a single CSV document. 
  • Standardize column names.
  • Removed the superfluous BikeID column.
  • Trimmed whitespace for the entire column with STRIP().
  • Converted data stored in the trip_length column from strings to integers.
  • Converted data stored in the birthyear column from floats to integers.
  • Created a new column named day_of_week that used the WEEKDAY() function to retrieve the weekday from the start_time column.
  • Created a new column that calculated trip duration from the start_time and  end_time columns, and then compared this column to the already existing trip_length column to find any inconsistencies (happily, there were none).

The Python scripts I used were created with the help of Claude AI. They can be found here.

One challenge that arose during data cleaning was that all trip lengths over 1,000 seconds were recorded as strings to prevent their commas from being interpreted as delimiters. This would make it impossible to perform calculations on the trip_length field. Further complicating matters is that Python's CSV module also interprets quotes as delimiters, so my initial scripts for converting the strings to integers were unsuccessful. Claude AI was able to write code to overcome this obstacle, but this highlighted a gap in my knowledge I will need to return to later. 

Sample Rows

Here is a sample from my cleaned table. You can also download it as a CSV.

trip_id
start_time
end_time
trip_length
day_of_week
from_station_id
from_station_name
to_station_id
to_station_name
usertype
gender
birthyear
22082920
19/03/2019
19/03/2019
472
Tuesday
654
Racine Ave & Washington Blvd (*)
174
Canal St & Madison St
Subscriber
Female
1970
22082921
19/03/2019
19/03/2019
560
Tuesday
638
Clinton St & Jackson Blvd (*)
212
Wells St & Hubbard St
Subscriber
Male
1981
22082922
19/03/2019
19/03/2019
206
Tuesday
77
Clinton St & Madison St
18
Wacker Dr & Washington St
Subscriber
Male
1962
22082923
19/03/2019
19/03/2019
182
Tuesday
99
Lake Shore Dr & Ohio St
635
Fairbanks St & Superior St (*)
Subscriber
Female
1985
22082924
19/03/2019
19/03/2019
105
Tuesday
302
Sheffield Ave & Wrightwood Ave
67
Sheffield Ave & Fullerton Ave
Subscriber
Male
1984
22082925
19/03/2019
19/03/2019
373
Tuesday
192
Canal St & Adams St
71
Morgan St & Lake St
Subscriber
Female
1989
22082926
19/03/2019
19/03/2019
350
Tuesday
247
Shore Dr & 55th St
425
Harper Ave & 59th St
Subscriber
Female
1989
22082927
19/03/2019
19/03/2019
632
Tuesday
174
Canal St & Madison St
283
LaSalle St & Jackson Blvd
Subscriber
Male
1963
22082928
19/03/2019
19/03/2019
2102
Tuesday
201
Indiana Ave & 40th St
174
Canal St & Madison St
Subscriber
Male
1984
22082929
19/03/2019
19/03/2019
164
Tuesday
402
Shields Ave & 31st St
403
Wentworth Ave & 33rd St
Subscriber
Male
1991
Data Integrity Questions

While cleaning and exploring the data, I was able to answer some questions about the data's integrity, but a few questions remained.​​

  • The number of trip starts matched the number of trip ends. This is how it should be, but it was satisfying to confirm.

  • Every trip has a value recorded for usertype, from_station_name, to_station_name, start_time, and end_time. There are zero null entries for any of these fields.

  • There were no trips under a minute. How come? There were 28899 trips under two minutes, (26702 by subscribers and 2197 by casual riders).

  •  The dataset includes 1848 trips over 24 hours, and even 418 trips over a week long! The longest trip in the dataset is 123 days long. Presumably some of these trips are errors, although there is no objective way to precisely define a "real trip" by its trip length.

    • I've decided that it is easier and safer to deal with outliers by filtering them out of my queries rather than deleting their entries from the dataset.​

  • The gender and birthyear were recorded for 99.8% of subscriber trips. This makes sense because that data is collected when the membership is created. However, gender and birthyear were also recorded for 39.5% of casual rider trips. How would Cyclistic have gathered this data?​​​ And why couldn't they collect it for the remaining 60.5%?

Go To Analysis.

bottom of page