top of page

Data Analysis for Cyclistic 

 

A Google Coursera Data Analytics Capstone Project

04

Analysis

With my dataset containing 3,818,004 records and twelve fields, I decided to create a Postgres database in pgAdmin4 so I could perform my analysis with SQL.

The code to create my database can be viewed here.

The goal of my analysis was to find ways that annual members ('Subscribers') and casual riders ('Customers') use Cyclistic bikes similarly and differently. I looked at seven variables:

  1. Trip lengths

  2. Popular days of the week

  3. Months/Seasons

  4. Time of day of trips

  5. Starting and ending stations

  6. User age

  7. User gender

I began with some simple queries to aggregate the data. First, I wanted to find the mean trip lengths and the absolute trip counts for each user type.

/* Mean trip length of each usertype */

SELECT
    usertype,
    (floor(avg(trip_length) / 86400)) || 'd ' ||
    (floor((avg(trip_length) % 86400) / 3600)) || 'h ' ||
    (floor((avg(trip_length) % 3600) / 60)) || 'm ' ||
    (floor(avg(trip_length) % 60)) || 's ' AS avg_trip
FROM
    bikeshare_data
WHERE trip_length < 86400 /* exclude 24hr+ trips */
GROUP BY usertype;

/* Count trips taken by each usertype */
SELECT

    usertype,

    COUNT(trip_id)
FROM

    bikeshare_data
WHERE trip_length < 86400 /* exclude 24hr+ trips */
GROUP BY usertype;

These queries return the following:

Mean trip length:

Annual Members: 12m 55s

Casual Riders: 39m 25s

Total trips taken:

Annual Members: 2,936,866

Casual Riders: 879,289

So while casual riders take longer trips than annual members (over 3x as long on average), annual members as a category take over 3x as many trips.

Next, I wanted to find which days were most popular with which types of users.

SELECT usertype, day_of_week, COUNT(trip_id) AS num_of_trips
FROM bikeshare_data
GROUP BY usertype, day_of_week
ORDER BY num_of_trips DESC;

This query returns the following:

usertype
day_of_week
num_of_trips
Subscriber
Tuesday
497025
Subscriber
Wednesday
494277
Subscriber
Thursday
486915
Subscriber
Monday
458780
Subscriber
Friday
456966
Subscriber
Saturday
287163
Subscriber
Sunday
256241
Customer
Saturday
208056
Customer
Sunday
170179
Customer
Friday
121141
Customer
Monday
101489
Customer
Thursday
101372
Customer
Wednesday
89745
Customer
Tuesday
88655

The results make it clear that annual members typically use Cyclistic bikes during the week, while casual riders take most of their rides on the weekends. This suggests that annual members disproportionately use Cyclistic bikes for their work commute compared to casual riders.

I also built a visualization in Tableau to see how annual and casual trip rates changed with the seasons.

Cyclistic_months_by_usertype.png

The summer and early fall months are the most popular for both annual members and casual riders. But during the winter months, Cyclistic's bikes are used almost exclusively by their annual members.

Next, I looked at what time of day users started their trips. You can view the SQL code for this query on GitHub.

This query finds when a given type of user is most likely to be using a Cyclistic bike.

Annual members start

23.3% of their trips before 9:00am,

31.2% of their trips between 9:00am and 4:00pm,

45.5% of their trips after 4:00pm.

Casual riders start

7.2% of their trips before 9:00am,

49.6% of their trips between 9:00am and 4:00pm,

43.2% of their trips after 4:00pm.

While both annual members and causal riders take many evening trips, casual members take a greater percentage of their trips during midday, and annual members take a much greater percentage of their trips in the morning.

Next, I wanted to find the bike stations that were popular with both annual members and casual users. One way to approach this question was by finding the station that had the smallest difference in subscriber and casual users. I used this code to do this.

This query returns the following. The total_count column is the sum of the count of annual member and casual rider trips that began at that station, and the rider_count_difference is the difference between those two counts (a smaller number means the two rider types used that station a similar number of times). 

station_name
total_count
rider_count_difference
Central Park Ave & Bloomingdale Ave
2072
78
Calumet Ave & 21st St
3973
119
Michigan Ave & Pearson St
10897
231
Lincoln Park Conservatory
3390
328
Lake Park Ave & 35th St
2446
376
Michigan Ave & Ida B Wells Dr
8496
384
Fort Dearborn Dr & 31st St
7425
443
Lakefront Trail & Bryn Mawr Ave
9659
703
Buckingham Fountain
6261
971
Cannon Dr & Fullerton Ave
6135
1079
Lakefront Trail & Wilson Ave
5924
1144
Wentworth Ave & Cermak Rd (Temp)
3740
1200
Broadway & Granville Ave
4406
1212
Marine Dr & Ainslie St
5172
1256
Walsh Park
3498
1344
Lake Park Ave & 53rd St
4161
1403
Sheridan Rd & Loyola Ave
3744
1474
Museum of Science and Industry
3664
1608
Broadway & Thorndale Ave
4393
1651
Lake Shore Dr & Ohio St
14442
1684

The above table includes the twenty stations with the smallest difference between the number of annual members and the number of casual users who start their rides at that station, so long as the station meets the 1,000 ride minimum threshold.

Since my team will be creating a marketing strategy targeting casual riders, I am especially interested in finding which stations they use the most and how many annual members also use those stations. This query finds that.

station_name
total_casual_uses
total_annual_member_uses
Streeter Dr & Grand Ave
106208
29758
Lake Shore Dr & Monroe St
78476
21132
Millennium Park
43498
24714
Michigan Ave & Oak St
42776
28122
Shedd Aquarium
41234
11630
Lake Shore Dr & North Blvd
37904
31040
Theater on the Lake
30054
33948
Dusable Harbor
25092
9214
Michigan Ave & Washington St
24456
50936
Adler Planetarium
23856
9614
Michigan Ave & 8th St
19050
10538
Montrose Harbor
16376
10810
Indiana Ave & Roosevelt Rd
16058
25844
Columbus Dr & Randolph St
15644
62740
Field Museum
15308
4804
McClurg Ct & Illinois St
13954
26172
Michigan Ave & Jackson Blvd
13834
17248
Clark St & Lincoln Ave
13430
25900
Clark St & Armitage Ave
13216
29762
Lake Shore Dr & Ohio St
12758
16126

I also wanted to know how important it was to target certain stations. I wrote queries to find what percentage of all casual trips start at the twenty most popular starting stations. I found that 34.3% of all casual trips began at one of the twenty most popular stations, despite those stations being only 3.1% of all 640 stations. 

Finally, I took a look at age and gender differences between the user types. The validity of this step of the analysis depends on the reliability of the age and gender data for casual riders, and I am less confident in the reliability of this data than for the other variables. One reason for this is that although we know the gender for 99.23% of annual members, we only know the gender for 39.08% of casual riders, and I can't rule out that there was a selection bias when the data was gathered.

Nevertheless, when we query the age and gender data, we get some interesting results. For example, when we calculate the mean age and mode age for each user type, we find the following:

Mean Age

Annual Members: 35

Casual Riders: 31

Mode Age

Annual Members: 27

Casual Riders: 25

Annual members trend older than casual riders. This is true both across their populations (mean) and for their most frequent users (mode). 

We can also find the gender breakdown of each user type:

25% of annual member trips are taken by women while 75% are taken by men.

38% of casual riders trips are taken by women while 62% are taken by men.

While men are overrepresented in both categories, annual member trips are even more likely to be taken by men than casual trips are.

Unfortunately, I cannot find the user type breakdown of each gender. This is because I have gender data for nearly all member trips but less than half of casual trips, and the results would be biased towards a larger membership count as a result. 

It is also important to remember that the records in the dataset represent individual trips, not individual customers. The data only shows how many trips were taken by users of the respective genders, it does not amount to a breakdown of the demographic makeup up the customer population. (For all we know given our data, the customer population could be an even split between men and women, but the men take many more trips than the women.)

Go To Key Findings.

bottom of page