Note: This blog post is my thorough output of the Cyclistic Marketing Case Study as a part of Google Capstone Project. I already did this case study before (check here), and that time, I used Google Sheets only. Now, I tried to incorporate it by using SQL and any other data analytic tools to make it more comprehensive.
Moreover, this post may contain affiliate links that at no additional cost to you, I may earn a small commission. Thank you for your support.
Business Tasks
Cyclistic Company defined casual riders are those customers who purchase either single-ride passes or full-day ride passes, while they defined Cyclistic members are those customers who purchased annual memberships. Concluded by financial analysts, the company has identified that its Cyclistic members contribute more significantly to the company's profits than casual riders. Maam Lily Moreno, a marketing director and also a stakeholder in this project, has decided to shift the focus of the marketing strategy away from acquiring new customers. Instead, the goal is to convert casual riders into Cyclistic members. As part of the company’s marketing team, she seeks to understand the differences between casual riders and Cyclistic members and how these differences can impact the design of the new marketing campaign.
Enroll Now (Financial Aid Available): Google Data Analytics Professional Certificate
Related Article: Google Data Analytics Certification Review -Coursera Journey
Description of Data Sources Used
The company provides a historical trip dataset that is every month, starting from December 2022 to November 2023. You can verify the Cyclistic trip data here. The data is organized by Motivate International Inc. which has a data license agreement. Listed below are the 12 .csv files that are used to analyze the data.
- 202212-divvy-tripdata
- 202301-divvy-tripdata
- 202302-divvy-tripdata
- 202303-divvy-tripdata
- 202304-divvy-tripdata
- 202305-divvy-tripdata
- 202306-divvy-tripdata
- 202307-divvy-tripdata
- 202308-divvy-tripdata
- 202309-divvy-tripdata
- 202310-divvy-tripdata
- 202311-divvy-tripdata
Every file contains trip data and is structured in a long form. Part of the data license agreement is to prohibit the use of riders’ personally identifiable information for data privacy so, it cannot access the credit card numbers of every rider. The following are the columns that are indicated in the 12 CSV files.
- ride_id
- rideable_type
- started_at
- ended_at
- start_station_name
- start_station_id
- end_station_name
- end_station_id
- start_lat
- start_lng
- end_lat
- end_lng
- member_casual
The data sources maintain ROCCC.
- Reliable- The datasets contain accurate and unbiased information on Cyclistic’s transactions every day.
- Original- The datasets is public and the source is based on the transactions in Chicago and has a license to use the data.
- Comprehensive- There is a column that differentiates between casual riders and members riders.
- Current- The latest public dataset available is December 2023 and the scope of this analysis one year.
- Cited- The data is filed by Motivate International Inc. which has a data license agreement. Allow us to use the datasets in public.
Even though data about the station name and ID are not fully complete and mostly blanks, it does not affect getting information about the difference between the two. Hence, the remaining data is enough to achieve our business task and have a great contribution to making a new plan for the marketing campaign.
Documentation of Data Cleaning and Data Manipulation
Microsoft Excel: Initial Data Cleaning and Manipulation
As part of the Prepare phase, I downloaded all the 12 zip files and unzipped them. I opened it and converted it from CSV to XLS files. I created folders for the CSV and XLS files. For each XLS file, I did the following:
- Created a column named ride_length, in which it subtracts the ended_at and started_at (=d2-c2) and formatted as hh:mm:ss.
- Created a column named day_of_week, in which determines the days of the week assigned by their numbers, and formatted as Numbers with no decimals.
- Then converted all the files into CSV files for uploading in Big Query.
Station name will be used in joins in Big Query.
Since my account in BigQuery was in trial mode, file sizes were limited to 100 MB, so some of the updated CSV files could not be uploaded. To pursue it, I modified again the CSV files:
- Opened new Excel file that is used for the station name, station id, and its longitude and latitude,
- Copied the start_station_name, start_station_id, end_station_name, end_station_id, start_lat, start_lng, end_lat, and end_lng columns from each CSV files.
- Pasted it in the Excel Files and used Remove Duplicates to delete all the duplications in the Excel and remains the unique data
- Saved it into CSV files and named as Station_name
- Deleted the start_station_name, start_station_id, end_station_name, and end_station_id column in each CSV file that I’d copied to reduce the file size.
- In that case, it would have a two sets of CSV files one for the monthly data and the other for the station names.
- All the CSV files were uploaded to BigQuery.
During the uploading of BigQuery, Some of the CSV files regarding to the datatrip did not continue because of the errors. I found out that the cause of the error is that time indicated in ended_at was earlier than the started_at. I did the following:
- Modified the ride_length in each CSV files by using an absolute to remove the negative symbol i.e. =abs(d2-c2)
- Saved it and uploaded it again.
BigQuery: Further Data Cleaning or Manipulating using Structured Query Language
Using the BigQuery SQL, I unite all the monthly data into one database using the UNION command. I also named the day_of_week column to the actual days of the week. See the query below:
--Objective: To clean or manipulate the data
--To unite all monthly data
With raw_data as
(SELECT * FROM `gold-pod-375712.Cyclistic_Marketing_Study.202212-divvy-tripdata`
union all
SELECT * FROM `gold-pod-375712.Cyclistic_Marketing_Study.202301-divvy-tripdata`
union all
SELECT * FROM `gold-pod-375712.Cyclistic_Marketing_Study.202302-divvy-tripdata`
union all
SELECT * FROM `gold-pod-375712.Cyclistic_Marketing_Study.202303-divvy-tripdata`
union all
SELECT * FROM `gold-pod-375712.Cyclistic_Marketing_Study.202304-divvy-tripdata`
union all
SELECT * FROM `gold-pod-375712.Cyclistic_Marketing_Study.202305-divvy-tripdata`
union all
SELECT * FROM `gold-pod-375712.Cyclistic_Marketing_Study.202306-divvy-tripdata`
union all
SELECT * FROM `gold-pod-375712.Cyclistic_Marketing_Study.202307-divvy-tripdata`
union all
SELECT * FROM `gold-pod-375712.Cyclistic_Marketing_Study.202308-divvy-tripdata`
union all
SELECT * FROM `gold-pod-375712.Cyclistic_Marketing_Study.202309-divvy-tripdata`
union all
SELECT * FROM `gold-pod-375712.Cyclistic_Marketing_Study.202310-divvy-tripdata`
union all
SELECT * FROM `gold-pod-375712.Cyclistic_Marketing_Study.202311-divvy-tripdata`
)
--Naming day_of_week into actual days
(select *
, case
when day_of_week = 1 then 'Sunday'
when day_of_week = 2 then 'Monday'
when day_of_week = 3 then 'Tuesday'
when day_of_week = 4 then 'Wednesday'
when day_of_week = 5 then 'Thursday'
when day_of_week = 6 then 'Friday'
when day_of_week = 7 then 'Saturday'
else null end as Weekdays
from raw_data
)
I also unite all the station names. The processed station names were also be used for analysis. See the query below:
--To unite all the station name
with raw_data_station_name as (
SELECT * FROM `gold-pod-375712.Cyclistic_Marketing_Study.Stations_name-p1`
union all
SELECT * FROM `gold-pod-375712.Cyclistic_Marketing_Study.Stations_name-p2`
union all
SELECT * FROM `gold-pod-375712.Cyclistic_Marketing_Study.Stations_name-p3`
)
--to get the start station name
,start0 as
(select
"Start" as Point
, start_lat
, start_lng
, start_station_name
, start_station_id
from raw_data_station_name)
, start1 as
(select * from start0
group by 1,2,3,4,5
)
, end0 as
(select
"End" as point
, end_lat
, end_lng
, end_station_name
, end_station_id
from raw_data_station_name)
, end1 as
(select * from end0
group by 1,2,3,4,5
)
--Uniting all the stations (both starting points and ending points)
,uniting as
(select * from end1
union all
select * from start1
order by 1 desc
)
select point
, end_lat as latittude
, end_lng as longitude
, end_station_id as station_id
, end_station_name as station_name
from uniting
group by 1,2,3,4,5
All the queries indicated above were saved as a view:
- Cyclistic_Clean_Data
- Cyclistic_Station_Data
Summary of Analysis
- Mean in a Year- 00:15:21
- Mean as to per Month
- January- 00:10:57
- February- 00:11:51
- March- 00:11:35
- April- 00:14:52
- May- 00:16:28
- June- 00:16:49
- July- 00:17:41
- August- 00:17:00
- September- 00:16:00
- October- 00:14:02
- November- 00:12:20
- December- 00:11:15
- Mean as to Customer Type
- Casual - 00:21:10
- Member - 00:12:03
- Mean as to Rideable Type
- Classic Bike: 00:17:09
- Docked Bike: 01:06:30
- Electric Bike: 00:12:18
- Maximum Casual Ride Length- 24:00:00 (1 day)
- Number of Rider per Type per Day
- Casual
- Sunday- 333,585
- Monday- 234,719
- Tuesday- 246,821
- Wednesday- 247,864
- Thursday- 271,151
- Friday- 309,570
- Saturday- 408,691
- Member
- Sunday- 402,607
- Monday- 491,868
- Tuesday- 575,396
- Wednesday- 578,840
- Thursday- 588,144
- Friday- 521,939
- Saturday- 466,415
- Top 10 Starting Point Places where Casual Riders Go
- Streeter Dr & Grand Avenu
- DuSable Lake Shore Dr & Monroe St.
- DuSable Lake Shore Dr & North Blvd
- Michigan Ave & Oak St.
- Theater on the Lake
- Millenium Park
- Kingsbury St, & Kinzie
- Halsted St.& Fulton St.
- N. Carpenter St. & WLake St.
- N. Green St. & W Lake St.
- Top 10 Ending Point Places where Casual Riders Go
- Streeter Dr & Grand Avenu
- DuSable Lake Shore Dr & Monroe St.
- DuSable Lake Shore Dr & North Blvd
- Michigan Ave & Oak St.
- Theater on the Lake
- Millenium Park
- Kingsbury St, & Kinzie
- Halsted St.& Fulton St.
- N. Carpenter St. & WLake St.
- N. Green St. & W Lake St.
Key Findings
Monthly Mean Usage of Cyclistic Customers
Number of Users per Ride Type per Day
Start and End Location of a Casual Riders
The Geo maps reveal that a significant number of casual riders initiate their bike rentals near Lake Michigan, specifically around Grand Avenue, and conclude their rides in the same vicinity. This pattern suggests that casual riders are increasingly choosing to spend their weekends in this area.
Recommendations
- Implement a comprehensive marketing campaign featuring discounts and promotions during the months of June, July, and August, particularly on weekends, as this is when the majority of casual riders tend to use our services.
- To reduce marketing expenses, I recommend utilizing banner advertisements or tarpaulins placed near Michigan Lake. Distributing flyers is not advisable.
- Identify whether casual riders are tourists or residents. Non-tourists represent a potential pool of membership candidates, so focusing on this demographic may increase our chances of converting them into members.
- It is essential to establish a reasonable pricing structure that differentiates between and member riders.
Google Data Analytics: 8-Course Series in Data Analytics
0 Comments