Unsupervised Model
An unsupervised model is a type of machine learning model where the algorithm learns patterns, structures, or relationships within the data without explicit supervision or labeled responses. Unlike supervised learning, where the model learns from input-output pairs, unsupervised learning deals with unlabeled data, seeking to uncover hidden patterns or groupings within the data.
K-Means
K-Means is a popular unsupervised machine learning algorithm used for clustering data points into groups or clusters based on similarities in their feature space. It’s a simple and efficient algorithm that aims to partition a dataset into K clusters, where each data point belongs to the cluster with the nearest mean, serving as the cluster’s “centroid.”
K-Means has various applications, including customer segmentation, document clustering, image segmentation, and anomaly detection, among others.
Understand the raw data
Data source: google public dataset – bigquery-public-data.london_bicycles.cycle_hire
Data schema:
Field name | Type | Mode |
---|---|---|
rental_id | INTEGER | REQUIRED |
duration | INTEGER | NULLABLE |
duration_ms | INTEGER | NULLABLE |
bike_id | INTEGER | NULLABLE |
bike_model | STRING | NULLABLE |
end_date | TIMESTAMP | NULLABLE |
end_station_id | INTEGER | NULLABLE |
end_station_name | STRING | NULLABLE |
start_date | TIMESTAMP | NULLABLE |
start_station_id | INTEGER | NULLABLE |
start_station_name | STRING | NULLABLE |
end_station_logical_terminal | INTEGER | NULLABLE |
start_station_logical_terminal | INTEGER | NULLABLE |
end_station_priority_id | INTEGER | NULLABLE |
This data contains the number of hires of London’s Santander Cycle Hire Scheme from 2011 to present. Data includes start and stop timestamps, station names and ride duration. It contains 83,434,866 number of rows and around 9.57 GB in total.
Transform the raw data
There are many reasons to make a clustering. In this scenario, I want to find out which stations are more naturally in common. According to the understanding of the dataset, we can extract some new features to make the clustering such as
- Duration of rentals
- Number of trips per day
- Distance from city center
However, none of them exists in the raw data, we need to make a transformation, extracting the useful information and remove the less related ones.
WITH hs AS ( SELECT h.start_station_name AS station_name, IF (EXTRACT(DAYOFWEEK FROM h.start_date) = 1 OR EXTRACT(DAYOFWEEK FROM h.start_date) = 7, "weekend", "weekday") AS isweekday, h.duration, ST_DISTANCE(ST_GEOGPOINT(s.longitude, s.latitude), ST_GEOGPOINT(-0.1, 51.5))/1000 AS distance_from_city_center FROM `bigquery-public-data.london_bicycles.cycle_hire` AS h JOIN `bigquery-public-data.london_bicycles.cycle_stations` AS s ON h.start_station_id = s.id WHERE h.start_date BETWEEN CAST('2015-01-01 00:00:00' AS TIMESTAMP) AND CAST('2016-01-01 00:00:00' AS TIMESTAMP) ), stationstats AS ( SELECT station_name, isweekday, AVG(duration) AS duration, COUNT(duration) AS num_trips, MAX(distance_from_city_center) AS distance_from_city_center FROM hs GROUP BY station_name, isweekday ) SELECT * FROM stationstats ORDER BY distance_from_city_center ASC
Then you will see the result like this. The transformed table contains 1592 rows.
Row | station_name | isweekday | duration | num_trips | distance_from_city_center |
---|---|---|---|---|---|
1 | Borough Road, Elephant & Castle | weekday | 1109.9321621151503 | 5749 | 0.12623965466425408 |
2 | Borough Road, Elephant & Castle | weekend | 2125.09582863585 | 1774 | 0.12623965466425408 |
3 | Webber Street , Southwark | weekday | 795.43961945680519 | 6517 | 0.16402063786209384 |
4 | Webber Street , Southwark | weekend | 938.35701050030923 | 1619 | 0.16402063786209384 |
5 | Great Suffolk Street, The Borough | weekday | 802.53029223093392 | 8418 | 0.19366718830977991 |
6 | Great Suffolk Street, The Borough | weekend | 1018.3102766798421 | 2024 | 0.19366718830977991 |
7 | LSBU (Borough Road), Elephant & Castle | weekday | 1228.1748438893846 | 5605 | 0.25790299799917293 |
8 | LSBU (Borough Road), Elephant & Castle | weekend | 1564.7163362952836 | 1463 | 0.25790299799917293 |
9 | Harper Road, The Borough | weekday | 968.18268315889657 | 3153 | 0.30630610581879264 |
10 | Harper Road, The Borough | weekend | 1443.6650246305419 | 1015 | 0.30630610581879 |
Create the Clustering Model
Based on the transformed result, we can start building the clustering model now.
CREATE OR REPLACE MODEL `london_bike.station_clusters` OPTIONS(model_type='kmeans', num_clusters=4) AS WITH hs AS ( SELECT h.start_station_name AS station_name, IF (EXTRACT(DAYOFWEEK FROM h.start_date) = 1 OR EXTRACT(DAYOFWEEK FROM h.start_date) = 7, "weekend", "weekday") AS isweekday, h.duration, ST_DISTANCE(ST_GEOGPOINT(s.longitude, s.latitude), ST_GEOGPOINT(-0.1, 51.5))/1000 AS distance_from_city_center FROM `bigquery-public-data.london_bicycles.cycle_hire` AS h JOIN `bigquery-public-data.london_bicycles.cycle_stations` AS s ON h.start_station_id = s.id WHERE h.start_date BETWEEN CAST('2015-01-01 00:00:00' AS TIMESTAMP) AND CAST('2016-01-01 00:00:00' AS TIMESTAMP) ), stationstats AS ( SELECT station_name, isweekday, AVG(duration) AS duration, COUNT(duration) AS num_trips, MAX(distance_from_city_center) AS distance_from_city_center FROM hs GROUP BY station_name, isweekday) SELECT * EXCEPT(station_name, isweekday) FROM stationstats
Model Evaluation
Evaluation results can help you to interpret and understand the different clusters. For example, centroid 3 shows a busy city station that is close to the city center. Centroid 2 shows the second city station which is less busy and used for longer duration rentals. Centroid 1 shows a less busy city station, with shorter duration rentals. Centroid 4 shows a suburban station with trips that are longer.
Prediction
WITH hs AS ( SELECT h.start_station_name AS station_name, IF (EXTRACT(DAYOFWEEK FROM h.start_date) = 1 OR EXTRACT(DAYOFWEEK FROM h.start_date) = 7, "weekend", "weekday") AS isweekday, h.duration, ST_DISTANCE(ST_GEOGPOINT(s.longitude, s.latitude), ST_GEOGPOINT(-0.1, 51.5))/1000 AS distance_from_city_center FROM `bigquery-public-data.london_bicycles.cycle_hire` AS h JOIN `bigquery-public-data.london_bicycles.cycle_stations` AS s ON h.start_station_id = s.id WHERE h.start_date BETWEEN CAST('2015-01-01 00:00:00' AS TIMESTAMP) AND CAST('2016-01-01 00:00:00' AS TIMESTAMP) ), stationstats AS ( SELECT station_name, isweekday, AVG(duration) AS duration, COUNT(duration) AS num_trips, MAX(distance_from_city_center) AS distance_from_city_center FROM hs GROUP BY station_name, isweekday ) SELECT * EXCEPT(nearest_centroids_distance) FROM ML.PREDICT( MODEL `instruction-415216.london_bike.station_clusters`, ( SELECT * FROM stationstats WHERE REGEXP_CONTAINS(station_name, 'Kennington')))
The result will look like this
Row | CENTROID_ID | station_name | isweekday | duration | num_trips | distance_from_city_center |
---|---|---|---|---|---|---|
1 | 1 | Cleaver Street, Kennington | weekend | 1239.4168096054887 | 1166 | 1.4967922765165333 |
2 | 1 | Kennington Road , Vauxhall | weekend | 1720.5983193277316 | 2975 | 0.89156462775600731 |
3 | 1 | Kennington Cross, Kennington | weekend | 1160.4822153258531 | 4189 | 1.4625875338501981 |
4 | 1 | Kennington Oval, Oval | weekend | 1503.6192468619249 | 2868 | 2.0831341271372983 |
5 | 1 | Cotton Garden Estate, Kennington | weekday | 1572.9192546583852 | 5313 | 1.1170336205776936 |
6 | 4 | Kennington Station, Kennington | weekend | 3579.2857142857138 | 1848 | 1.298667735696762 |
7 | 1 | Kennington Lane Rail Bridge, Vauxhall | weekend | 1812.2177901874747 | 5014 | 2.1750328347653012 |
8 | 1 | Doddington Grove, Kennington | weekend | 1972.4096385542168 | 2158 | 1.4681405273793819 |
9 | 3 | Kennington Cross, Kennington | weekday | 911.52387777705371 | 15349 | 1.4625875338501981 |
10 | 1 | Kennington Road Post Office, Oval | weekend | 2396.9247910863505 | 1795 | 1.84603345094448 |
11 | 1 | Kennington Oval, Oval | weekday | 1272.5028595196009 | 9617 | 2.0831341271372983 |
12 | 1 | Cleaver Street, Kennington | weekday | 1228.8052773027123 | 4093 | 1.4967922765165333 |
13 | 1 | Kennington Station, Kennington | weekday | 1689.5876923076924 | 4875 | 1.29866773569676 |