Data Analysis for Cyclistic
A Google Coursera Data Analytics Capstone Project
04
Analysis
With my dataset containing 3,818,004 records and twelve fields, I decided to create a Postgres database in pgAdmin4 so I could perform my analysis with SQL.
The code to create my database can be viewed here.
The goal of my analysis was to find ways that annual members ('Subscribers') and casual riders ('Customers') use Cyclistic bikes similarly and differently. I looked at seven variables:
-
Trip lengths
-
Popular days of the week
-
Months/Seasons
-
Time of day of trips
-
Starting and ending stations
-
User age
-
User gender
I began with some simple queries to aggregate the data. First, I wanted to find the mean trip lengths and the absolute trip counts for each user type.
/* Mean trip length of each usertype */
SELECT
usertype,
(floor(avg(trip_length) / 86400)) || 'd ' ||
(floor((avg(trip_length) % 86400) / 3600)) || 'h ' ||
(floor((avg(trip_length) % 3600) / 60)) || 'm ' ||
(floor(avg(trip_length) % 60)) || 's ' AS avg_trip
FROM
bikeshare_data
WHERE trip_length < 86400 /* exclude 24hr+ trips */
GROUP BY usertype;
/* Count trips taken by each usertype */
SELECT
usertype,
COUNT(trip_id)
FROM
bikeshare_data
WHERE trip_length < 86400 /* exclude 24hr+ trips */
GROUP BY usertype;
These queries return the following:
Mean trip length:
Annual Members: 12m 55s
Casual Riders: 39m 25s
Total trips taken:
Annual Members: 2,936,866
Casual Riders: 879,289
So while casual riders take longer trips than annual members (over 3x as long on average), annual members as a category take over 3x as many trips.
Next, I wanted to find which days were most popular with which types of users.
SELECT usertype, day_of_week, COUNT(trip_id) AS num_of_trips
FROM bikeshare_data
GROUP BY usertype, day_of_week
ORDER BY num_of_trips DESC;
This query returns the following:
usertype | day_of_week | num_of_trips |
|---|---|---|
Subscriber | Tuesday | 497025 |
Subscriber | Wednesday | 494277 |
Subscriber | Thursday | 486915 |
Subscriber | Monday | 458780 |
Subscriber | Friday | 456966 |
Subscriber | Saturday | 287163 |
Subscriber | Sunday | 256241 |
Customer | Saturday | 208056 |
Customer | Sunday | 170179 |
Customer | Friday | 121141 |
Customer | Monday | 101489 |
Customer | Thursday | 101372 |
Customer | Wednesday | 89745 |
Customer | Tuesday | 88655 |
The results make it clear that annual members typically use Cyclistic bikes during the week, while casual riders take most of their rides on the weekends. This suggests that annual members disproportionately use Cyclistic bikes for their work commute compared to casual riders.
I also built a visualization in Tableau to see how annual and casual trip rates changed with the seasons.

The summer and early fall months are the most popular for both annual members and casual riders. But during the winter months, Cyclistic's bikes are used almost exclusively by their annual members.
Next, I looked at what time of day users started their trips. You can view the SQL code for this query on GitHub.
This query finds when a given type of user is most likely to be using a Cyclistic bike.
Annual members start
23.3% of their trips before 9:00am,
31.2% of their trips between 9:00am and 4:00pm,
45.5% of their trips after 4:00pm.
Casual riders start
7.2% of their trips before 9:00am,
49.6% of their trips between 9:00am and 4:00pm,
43.2% of their trips after 4:00pm.
While both annual members and causal riders take many evening trips, casual members take a greater percentage of their trips during midday, and annual members take a much greater percentage of their trips in the morning.
Next, I wanted to find the bike stations that were popular with both annual members and casual users. One way to approach this question was by finding the station that had the smallest difference in subscriber and casual users. I used this code to do this.
This query returns the following. The total_count column is the sum of the count of annual member and casual rider trips that began at that station, and the rider_count_difference is the difference between those two counts (a smaller number means the two rider types used that station a similar number of times).
station_name | total_count | rider_count_difference |
|---|---|---|
Central Park Ave & Bloomingdale Ave | 2072 | 78 |
Calumet Ave & 21st St | 3973 | 119 |
Michigan Ave & Pearson St | 10897 | 231 |
Lincoln Park Conservatory | 3390 | 328 |
Lake Park Ave & 35th St | 2446 | 376 |
Michigan Ave & Ida B Wells Dr | 8496 | 384 |
Fort Dearborn Dr & 31st St | 7425 | 443 |
Lakefront Trail & Bryn Mawr Ave | 9659 | 703 |
Buckingham Fountain | 6261 | 971 |
Cannon Dr & Fullerton Ave | 6135 | 1079 |
Lakefront Trail & Wilson Ave | 5924 | 1144 |
Wentworth Ave & Cermak Rd (Temp) | 3740 | 1200 |
Broadway & Granville Ave | 4406 | 1212 |
Marine Dr & Ainslie St | 5172 | 1256 |
Walsh Park | 3498 | 1344 |
Lake Park Ave & 53rd St | 4161 | 1403 |
Sheridan Rd & Loyola Ave | 3744 | 1474 |
Museum of Science and Industry | 3664 | 1608 |
Broadway & Thorndale Ave | 4393 | 1651 |
Lake Shore Dr & Ohio St | 14442 | 1684 |
The above table includes the twenty stations with the smallest difference between the number of annual members and the number of casual users who start their rides at that station, so long as the station meets the 1,000 ride minimum threshold.
Since my team will be creating a marketing strategy targeting casual riders, I am especially interested in finding which stations they use the most and how many annual members also use those stations. This query finds that.
station_name | total_casual_uses | total_annual_member_uses |
|---|---|---|
Streeter Dr & Grand Ave | 106208 | 29758 |
Lake Shore Dr & Monroe St | 78476 | 21132 |
Millennium Park | 43498 | 24714 |
Michigan Ave & Oak St | 42776 | 28122 |
Shedd Aquarium | 41234 | 11630 |
Lake Shore Dr & North Blvd | 37904 | 31040 |
Theater on the Lake | 30054 | 33948 |
Dusable Harbor | 25092 | 9214 |
Michigan Ave & Washington St | 24456 | 50936 |
Adler Planetarium | 23856 | 9614 |
Michigan Ave & 8th St | 19050 | 10538 |
Montrose Harbor | 16376 | 10810 |
Indiana Ave & Roosevelt Rd | 16058 | 25844 |
Columbus Dr & Randolph St | 15644 | 62740 |
Field Museum | 15308 | 4804 |
McClurg Ct & Illinois St | 13954 | 26172 |
Michigan Ave & Jackson Blvd | 13834 | 17248 |
Clark St & Lincoln Ave | 13430 | 25900 |
Clark St & Armitage Ave | 13216 | 29762 |
Lake Shore Dr & Ohio St | 12758 | 16126 |
I also wanted to know how important it was to target certain stations. I wrote queries to find what percentage of all casual trips start at the twenty most popular starting stations. I found that 34.3% of all casual trips began at one of the twenty most popular stations, despite those stations being only 3.1% of all 640 stations.
Finally, I took a look at age and gender differences between the user types. The validity of this step of the analysis depends on the reliability of the age and gender data for casual riders, and I am less confident in the reliability of this data than for the other variables. One reason for this is that although we know the gender for 99.23% of annual members, we only know the gender for 39.08% of casual riders, and I can't rule out that there was a selection bias when the data was gathered.
Nevertheless, when we query the age and gender data, we get some interesting results. For example, when we calculate the mean age and mode age for each user type, we find the following:
Mean Age
Annual Members: 35
Casual Riders: 31
Mode Age
Annual Members: 27
Casual Riders: 25
Annual members trend older than casual riders. This is true both across their populations (mean) and for their most frequent users (mode).
We can also find the gender breakdown of each user type:
25% of annual member trips are taken by women while 75% are taken by men.
38% of casual riders trips are taken by women while 62% are taken by men.
While men are overrepresented in both categories, annual member trips are even more likely to be taken by men than casual trips are.
Unfortunately, I cannot find the user type breakdown of each gender. This is because I have gender data for nearly all member trips but less than half of casual trips, and the results would be biased towards a larger membership count as a result.
It is also important to remember that the records in the dataset represent individual trips, not individual customers. The data only shows how many trips were taken by users of the respective genders, it does not amount to a breakdown of the demographic makeup up the customer population. (For all we know given our data, the customer population could be an even split between men and women, but the men take many more trips than the women.)
Go To Key Findings.
