top of page

Data Analysis for Cyclistic 

 

A Google Coursera Data Analytics Capstone Project

04

Analysis

With my dataset containing 3,818,004 records and twelve fields, I decided to create a Postgres database in pgAdmin4 so I could perform my analysis with SQL.

The code to create my database can be viewed here.

The goal of my analysis was to find ways that annual members ('Subscribers') and casual riders ('Customers') use Cyclistic bikes similarly and differently. I looked at seven variables:

  1. Trip lengths

  2. Popular days of the week

  3. Months/Seasons

  4. Time of day of trips

  5. Starting and ending stations

  6. User age

  7. User gender

I began with some simple queries to aggregate the data. First, I wanted to find the mean trip lengths and the absolute trip counts for each user type.

/* Mean trip length of each usertype */

SELECT
    usertype,
    (floor(avg(trip_length) / 86400)) || 'd ' ||
    (floor((avg(trip_length) % 86400) / 3600)) || 'h ' ||
    (floor((avg(trip_length) % 3600) / 60)) || 'm ' ||
    (floor(avg(trip_length) % 60)) || 's ' AS avg_trip
FROM
    bikeshare_data
WHERE trip_length < 86400 /* exclude 24hr+ trips */
GROUP BY usertype;

/* Count trips taken by each usertype */
SELECT

    usertype,

    COUNT(trip_id)
FROM

    bikeshare_data
WHERE trip_length < 86400 /* exclude 24hr+ trips */
GROUP BY usertype;

These queries return the following:

Mean trip length:

Annual Members: 12m 55s

Casual Riders: 39m 25s

Total trips taken:

Annual Members: 2,936,866

Casual Riders: 879,289

So while casual riders take longer trips than annual members (over 3x as long on average), annual members as a category take over 3x as many trips.

Next, I wanted to find which days were most popular with which types of users.

SELECT usertype, day_of_week, COUNT(trip_id) AS num_of_trips
FROM bikeshare_data
GROUP BY usertype, day_of_week
ORDER BY num_of_trips DESC;

This query returns the following:

usertype
day_of_week
num_of_trips
Customer
Wednesday
89745
Customer
Tuesday
88655

The results make it clear that annual members typically use Cyclistic bikes during the week, while casual riders take most of their rides on the weekends. This suggests that annual members disproportionately use Cyclistic bikes for their work commute compared to casual riders.

I also built a visualization in Tableau to see how annual and casual trip rates changed with the seasons.

Cyclistic_months_by_usertype.png

The summer and early fall months are the most popular for both annual members and casual riders. But during the winter months, Cyclistic's bikes are used almost exclusively by their annual members.

Next, I looked at what time of day users started their trips. You can view the SQL code for this query on GitHub.

This query finds when a given type of user is most likely to be using a Cyclistic bike.

Annual members start

23.3% of their trips before 9:00am,

31.2% of their trips between 9:00am and 4:00pm,

45.5% of their trips after 4:00pm.

Casual riders start

7.2% of their trips before 9:00am,

49.6% of their trips between 9:00am and 4:00pm,

43.2% of their trips after 4:00pm.

While both annual members and causal riders take many evening trips, casual members take a greater percentage of their trips during midday, and annual members take a much greater percentage of their trips in the morning.

Next, I wanted to find the bike stations that were popular with both annual members and casual users. One way to approach this question was by finding the station that had the smallest difference in subscriber and casual users. I used this code to do this.

This query returns the following. The total_count column is the sum of the count of annual member and casual rider trips that began at that station, and the rider_count_difference is the difference between those two counts (a smaller number means the two rider types used that station a similar number of times). 

station_name
total_count
rider_count_difference
Broadway & Granville Ave
4406
1212
Marine Dr & Ainslie St
5172
1256
Walsh Park
3498
1344
Lake Park Ave & 53rd St
4161
1403
Sheridan Rd & Loyola Ave
3744
1474
Museum of Science and Industry
3664
1608
Broadway & Thorndale Ave
4393
1651
Lake Shore Dr & Ohio St
14442
1684

The above table includes the twenty stations with the smallest difference between the number of annual members and the number of casual users who start their rides at that station, so long as the station meets the 1,000 ride minimum threshold.

Since my team will be creating a marketing strategy targeting casual riders, I am especially interested in finding which stations they use the most and how many annual members also use those stations. This query finds that.

station_name
total_casual_uses
total_annual_member_uses
Indiana Ave & Roosevelt Rd
16058
25844
Columbus Dr & Randolph St
15644
62740
Field Museum
15308
4804
McClurg Ct & Illinois St
13954
26172
Michigan Ave & Jackson Blvd
13834
17248
Clark St & Lincoln Ave
13430
25900
Clark St & Armitage Ave
13216
29762
Lake Shore Dr & Ohio St
12758
16126

I also wanted to know how important it was to target certain stations. I wrote queries to find what percentage of all casual trips start at the twenty most popular starting stations. I found that 34.3% of all casual trips began at one of the twenty most popular stations, despite those stations being only 3.1% of all 640 stations. 

Finally, I took a look at age and gender differences between the user types. The validity of this step of the analysis depends on the reliability of the age and gender data for casual riders, and I am less confident in the reliability of this data than for the other variables. One reason for this is that although we know the gender for 99.23% of annual members, we only know the gender for 39.08% of casual riders, and I can't rule out that there was a selection bias when the data was gathered.

Nevertheless, when we query the age and gender data, we get some interesting results. For example, when we calculate the mean age and mode age for each user type, we find the following:

Mean Age

Annual Members: 35

Casual Riders: 31

Mode Age

Annual Members: 27

Casual Riders: 25

Annual members trend older than casual riders. This is true both across their populations (mean) and for their most frequent users (mode). 

We can also find the gender breakdown of each user type:

25% of annual member trips are taken by women while 75% are taken by men.

38% of casual riders trips are taken by women while 62% are taken by men.

While men are overrepresented in both categories, annual member trips are even more likely to be taken by men than casual trips are.

Unfortunately, I cannot find the user type breakdown of each gender. This is because I have gender data for nearly all member trips but less than half of casual trips, and the results would be biased towards a larger membership count as a result. 

It is also important to remember that the records in the dataset represent individual trips, not individual customers. The data only shows how many trips were taken by users of the respective genders, it does not amount to a breakdown of the demographic makeup up the customer population. (For all we know given our data, the customer population could be an even split between men and women, but the men take many more trips than the women.)

Go To Key Findings.

bottom of page