Data Analysis for Cyclistic

A Google Coursera Data Analytics Capstone Project

Analysis

With my dataset containing 3,818,004 records and twelve fields, I decided to create a Postgres database in pgAdmin4 so I could perform my analysis with SQL.

The code to create my database can be viewed here.

The goal of my analysis was to find ways that annual members ('Subscribers') and casual riders ('Customers') use Cyclistic bikes similarly and differently. I looked at seven variables:

Trip lengths
Popular days of the week
Months/Seasons
Time of day of trips
Starting and ending stations
User age
User gender

I began with some simple queries to aggregate the data. First, I wanted to find the mean trip lengths and the absolute trip counts for each user type.

/* Mean trip length of each usertype */

SELECT
usertype,
(floor(avg(trip_length) / 86400)) || 'd ' ||
(floor((avg(trip_length) % 86400) / 3600)) || 'h ' ||
(floor((avg(trip_length) % 3600) / 60)) || 'm ' ||
(floor(avg(trip_length) % 60)) || 's ' AS avg_trip
FROM
bikeshare_data
WHERE trip_length < 86400 /* exclude 24hr+ trips */
GROUP BY usertype;

/* Count trips taken by each usertype */
SELECT

usertype,

COUNT(trip_id)
FROM

bikeshare_data
WHERE trip_length < 86400 /* exclude 24hr+ trips */
GROUP BY usertype;

These queries return the following:

Mean trip length:

Annual Members: 12m 55s

Casual Riders: 39m 25s

Total trips taken:

Annual Members: 2,936,866

Casual Riders: 879,289

So while casual riders take longer trips than annual members (over 3x as long on average), annual members as a category take over 3x as many trips.

Next, I wanted to find which days were most popular with which types of users.

SELECT usertype, day_of_week, COUNT(trip_id) AS num_of_trips
FROM bikeshare_data
GROUP BY usertype, day_of_week
ORDER BY num_of_trips DESC;

This query returns the following:

usertype	day_of_week	num_of_trips
Subscriber	Tuesday	497025
Subscriber	Wednesday	494277
Subscriber	Thursday	486915
Subscriber	Monday	458780
Subscriber	Friday	456966
Subscriber	Saturday	287163
Subscriber	Sunday	256241
Customer	Saturday	208056
Customer	Sunday	170179
Customer	Friday	121141
Customer	Monday	101489
Customer	Thursday	101372
Customer	Wednesday	89745
Customer	Tuesday	88655

The results make it clear that annual members typically use Cyclistic bikes during the week, while casual riders take most of their rides on the weekends. This suggests that annual members disproportionately use Cyclistic bikes for their work commute compared to casual riders.

I also built a visualization in Tableau to see how annual and casual trip rates changed with the seasons.

The summer and early fall months are the most popular for both annual members and casual riders. But during the winter months, Cyclistic's bikes are used almost exclusively by their annual members.

Next, I looked at what time of day users started their trips. You can view the SQL code for this query on GitHub.

This query finds when a given type of user is most likely to be using a Cyclistic bike.

Annual members start

23.3% of their trips before 9:00am,

31.2% of their trips between 9:00am and 4:00pm,

45.5% of their trips after 4:00pm.

Casual riders start

7.2% of their trips before 9:00am,

49.6% of their trips between 9:00am and 4:00pm,

43.2% of their trips after 4:00pm.

While both annual members and causal riders take many evening trips, casual members take a greater percentage of their trips during midday, and annual members take a much greater percentage of their trips in the morning.

Next, I wanted to find the bike stations that were popular with both annual members and casual users. One way to approach this question was by finding the station that had the smallest difference in subscriber and casual users. I used this code to do this.

This query returns the following. The total_count column is the sum of the count of annual member and casual rider trips that began at that station, and the rider_count_difference is the difference between those two counts (a smaller number means the two rider types used that station a similar number of times).

station_name	total_count	rider_count_difference
Central Park Ave & Bloomingdale Ave	2072	78
Calumet Ave & 21st St	3973	119
Michigan Ave & Pearson St	10897	231
Lincoln Park Conservatory	3390	328
Lake Park Ave & 35th St	2446	376
Michigan Ave & Ida B Wells Dr	8496	384
Fort Dearborn Dr & 31st St	7425	443
Lakefront Trail & Bryn Mawr Ave	9659	703
Buckingham Fountain	6261	971
Cannon Dr & Fullerton Ave	6135	1079
Lakefront Trail & Wilson Ave	5924	1144
Wentworth Ave & Cermak Rd (Temp)	3740	1200
Broadway & Granville Ave	4406	1212
Marine Dr & Ainslie St	5172	1256
Walsh Park	3498	1344
Lake Park Ave & 53rd St	4161	1403
Sheridan Rd & Loyola Ave	3744	1474
Museum of Science and Industry	3664	1608
Broadway & Thorndale Ave	4393	1651
Lake Shore Dr & Ohio St	14442	1684

The above table includes the twenty stations with the smallest difference between the number of annual members and the number of casual users who start their rides at that station, so long as the station meets the 1,000 ride minimum threshold.

Since my team will be creating a marketing strategy targeting casual riders, I am especially interested in finding which stations they use the most and how many annual members also use those stations. This query finds that.

station_name	total_casual_uses	total_annual_member_uses
Streeter Dr & Grand Ave	106208	29758
Lake Shore Dr & Monroe St	78476	21132
Millennium Park	43498	24714
Michigan Ave & Oak St	42776	28122
Shedd Aquarium	41234	11630
Lake Shore Dr & North Blvd	37904	31040
Theater on the Lake	30054	33948
Dusable Harbor	25092	9214
Michigan Ave & Washington St	24456	50936
Adler Planetarium	23856	9614
Michigan Ave & 8th St	19050	10538
Montrose Harbor	16376	10810
Indiana Ave & Roosevelt Rd	16058	25844
Columbus Dr & Randolph St	15644	62740
Field Museum	15308	4804
McClurg Ct & Illinois St	13954	26172
Michigan Ave & Jackson Blvd	13834	17248
Clark St & Lincoln Ave	13430	25900
Clark St & Armitage Ave	13216	29762
Lake Shore Dr & Ohio St	12758	16126

I also wanted to know how important it was to target certain stations. I wrote queries to find what percentage of all casual trips start at the twenty most popular starting stations. I found that 34.3% of all casual trips began at one of the twenty most popular stations, despite those stations being only 3.1% of all 640 stations.

Finally, I took a look at age and gender differences between the user types. The validity of this step of the analysis depends on the reliability of the age and gender data for casual riders, and I am less confident in the reliability of this data than for the other variables. One reason for this is that although we know the gender for 99.23% of annual members, we only know the gender for 39.08% of casual riders, and I can't rule out that there was a selection bias when the data was gathered.

Nevertheless, when we query the age and gender data, we get some interesting results. For example, when we calculate the mean age and mode age for each user type, we find the following:

Mean Age

Annual Members: 35

Casual Riders: 31

Mode Age

Annual Members: 27

Casual Riders: 25

Annual members trend older than casual riders. This is true both across their populations (mean) and for their most frequent users (mode).

We can also find the gender breakdown of each user type:

25% of annual member trips are taken by women while 75% are taken by men.

38% of casual riders trips are taken by women while 62% are taken by men.

While men are overrepresented in both categories, annual member trips are even more likely to be taken by men than casual trips are.

Unfortunately, I cannot find the user type breakdown of each gender. This is because I have gender data for nearly all member trips but less than half of casual trips, and the results would be biased towards a larger membership count as a result.

It is also important to remember that the records in the dataset represent individual trips, not individual customers. The data only shows how many trips were taken by users of the respective genders, it does not amount to a breakdown of the demographic makeup up the customer population. (For all we know given our data, the customer population could be an even split between men and women, but the men take many more trips than the women.)

Go To Key Findings.