Make a Clustering model with BigQuery ML

Unsupervised Model

An unsupervised model is a type of machine learning model where the algorithm learns patterns, structures, or relationships within the data without explicit supervision or labeled responses. Unlike supervised learning, where the model learns from input-output pairs, unsupervised learning deals with unlabeled data, seeking to uncover hidden patterns or groupings within the data.

K-Means

K-Means is a popular unsupervised machine learning algorithm used for clustering data points into groups or clusters based on similarities in their feature space. It’s a simple and efficient algorithm that aims to partition a dataset into K clusters, where each data point belongs to the cluster with the nearest mean, serving as the cluster’s “centroid.”

K-Means has various applications, including customer segmentation, document clustering, image segmentation, and anomaly detection, among others.

Understand the raw data

Data source: google public dataset – bigquery-public-data.london_bicycles.cycle_hire

Data schema:

Field name	Type	Mode
rental_id	INTEGER	REQUIRED
duration	INTEGER	NULLABLE
duration_ms	INTEGER	NULLABLE
bike_id	INTEGER	NULLABLE
bike_model	STRING	NULLABLE
end_date	TIMESTAMP	NULLABLE
end_station_id	INTEGER	NULLABLE
end_station_name	STRING	NULLABLE
start_date	TIMESTAMP	NULLABLE
start_station_id	INTEGER	NULLABLE
start_station_name	STRING	NULLABLE
end_station_logical_terminal	INTEGER	NULLABLE
start_station_logical_terminal	INTEGER	NULLABLE
end_station_priority_id	INTEGER	NULLABLE

This data contains the number of hires of London’s Santander Cycle Hire Scheme from 2011 to present. Data includes start and stop timestamps, station names and ride duration. It contains 83,434,866 number of rows and around 9.57 GB in total.

Transform the raw data

There are many reasons to make a clustering. In this scenario, I want to find out which stations are more naturally in common. According to the understanding of the dataset, we can extract some new features to make the clustering such as

Duration of rentals
Number of trips per day
Distance from city center

However, none of them exists in the raw data, we need to make a transformation, extracting the useful information and remove the less related ones.

WITH
hs AS (
SELECT
  h.start_station_name AS station_name,
  IF
  (EXTRACT(DAYOFWEEK
    FROM
      h.start_date) = 1
    OR EXTRACT(DAYOFWEEK
    FROM
      h.start_date) = 7,
    "weekend",
    "weekday") AS isweekday,
  h.duration,
  ST_DISTANCE(ST_GEOGPOINT(s.longitude,
      s.latitude),
    ST_GEOGPOINT(-0.1,
      51.5))/1000 AS distance_from_city_center
FROM
  `bigquery-public-data.london_bicycles.cycle_hire` AS h
JOIN
  `bigquery-public-data.london_bicycles.cycle_stations` AS s
ON
  h.start_station_id = s.id
WHERE
  h.start_date BETWEEN CAST('2015-01-01 00:00:00' AS TIMESTAMP)
  AND CAST('2016-01-01 00:00:00' AS TIMESTAMP) ),
stationstats AS (
SELECT
  station_name,
  isweekday,
  AVG(duration) AS duration,
  COUNT(duration) AS num_trips,
  MAX(distance_from_city_center) AS distance_from_city_center
FROM
  hs
GROUP BY
  station_name, isweekday )
SELECT
*
FROM
stationstats
ORDER BY
distance_from_city_center ASC

Then you will see the result like this. The transformed table contains 1592 rows.

Row	station_name	isweekday	duration	num_trips	distance_from_city_center
1	Borough Road, Elephant & Castle	weekday	1109.9321621151503	5749	0.12623965466425408
2	Borough Road, Elephant & Castle	weekend	2125.09582863585	1774	0.12623965466425408
3	Webber Street , Southwark	weekday	795.43961945680519	6517	0.16402063786209384
4	Webber Street , Southwark	weekend	938.35701050030923	1619	0.16402063786209384
5	Great Suffolk Street, The Borough	weekday	802.53029223093392	8418	0.19366718830977991
6	Great Suffolk Street, The Borough	weekend	1018.3102766798421	2024	0.19366718830977991
7	LSBU (Borough Road), Elephant & Castle	weekday	1228.1748438893846	5605	0.25790299799917293
8	LSBU (Borough Road), Elephant & Castle	weekend	1564.7163362952836	1463	0.25790299799917293
9	Harper Road, The Borough	weekday	968.18268315889657	3153	0.30630610581879264
10	Harper Road, The Borough	weekend	1443.6650246305419	1015	0.30630610581879

Create the Clustering Model

Based on the transformed result, we can start building the clustering model now.

CREATE OR REPLACE MODEL `london_bike.station_clusters`
  OPTIONS(model_type='kmeans', 
          num_clusters=4) AS
WITH
  hs AS (
  SELECT
    h.start_station_name AS station_name,
  IF
    (EXTRACT(DAYOFWEEK
      FROM
        h.start_date) = 1
      OR EXTRACT(DAYOFWEEK
      FROM
        h.start_date) = 7,
      "weekend",
      "weekday") AS isweekday,
    h.duration,
    ST_DISTANCE(ST_GEOGPOINT(s.longitude,
        s.latitude),
      ST_GEOGPOINT(-0.1,
        51.5))/1000 AS distance_from_city_center
  FROM
    `bigquery-public-data.london_bicycles.cycle_hire` AS h
  JOIN
    `bigquery-public-data.london_bicycles.cycle_stations` AS s
  ON
    h.start_station_id = s.id
  WHERE
    h.start_date BETWEEN CAST('2015-01-01 00:00:00' AS TIMESTAMP)
    AND CAST('2016-01-01 00:00:00' AS TIMESTAMP) ),
  stationstats AS (
  SELECT
    station_name,
    isweekday,
    AVG(duration) AS duration,
    COUNT(duration) AS num_trips,
    MAX(distance_from_city_center) AS distance_from_city_center
  FROM
    hs
  GROUP BY
    station_name, isweekday)
SELECT
  * EXCEPT(station_name, isweekday)
FROM
  stationstats

Model Evaluation

Evaluation results can help you to interpret and understand the different clusters. For example, centroid 3 shows a busy city station that is close to the city center. Centroid 2 shows the second city station which is less busy and used for longer duration rentals. Centroid 1 shows a less busy city station, with shorter duration rentals. Centroid 4 shows a suburban station with trips that are longer.

Prediction

WITH
  hs AS (
  SELECT
    h.start_station_name AS station_name,
    IF
    (EXTRACT(DAYOFWEEK
      FROM
        h.start_date) = 1
      OR EXTRACT(DAYOFWEEK
      FROM
        h.start_date) = 7,
      "weekend",
      "weekday") AS isweekday,
    h.duration,
    ST_DISTANCE(ST_GEOGPOINT(s.longitude,
        s.latitude),
      ST_GEOGPOINT(-0.1,
        51.5))/1000 AS distance_from_city_center
  FROM
    `bigquery-public-data.london_bicycles.cycle_hire` AS h
  JOIN
    `bigquery-public-data.london_bicycles.cycle_stations` AS s
  ON
    h.start_station_id = s.id
  WHERE
    h.start_date BETWEEN CAST('2015-01-01 00:00:00' AS TIMESTAMP)
    AND CAST('2016-01-01 00:00:00' AS TIMESTAMP) ),
  stationstats AS (
  SELECT
    station_name,
    isweekday,
    AVG(duration) AS duration,
    COUNT(duration) AS num_trips,
    MAX(distance_from_city_center) AS distance_from_city_center
  FROM
    hs
  GROUP BY
    station_name, isweekday )
SELECT
  * EXCEPT(nearest_centroids_distance)
FROM
  ML.PREDICT( MODEL `instruction-415216.london_bike.station_clusters`,
    (
    SELECT
      *
    FROM
      stationstats
    WHERE
      REGEXP_CONTAINS(station_name, 'Kennington')))

The result will look like this

Row	CENTROID_ID	station_name	isweekday	duration	num_trips	distance_from_city_center
1	1	Cleaver Street, Kennington	weekend	1239.4168096054887	1166	1.4967922765165333
2	1	Kennington Road , Vauxhall	weekend	1720.5983193277316	2975	0.89156462775600731
3	1	Kennington Cross, Kennington	weekend	1160.4822153258531	4189	1.4625875338501981
4	1	Kennington Oval, Oval	weekend	1503.6192468619249	2868	2.0831341271372983
5	1	Cotton Garden Estate, Kennington	weekday	1572.9192546583852	5313	1.1170336205776936
6	4	Kennington Station, Kennington	weekend	3579.2857142857138	1848	1.298667735696762
7	1	Kennington Lane Rail Bridge, Vauxhall	weekend	1812.2177901874747	5014	2.1750328347653012
8	1	Doddington Grove, Kennington	weekend	1972.4096385542168	2158	1.4681405273793819
9	3	Kennington Cross, Kennington	weekday	911.52387777705371	15349	1.4625875338501981
10	1	Kennington Road Post Office, Oval	weekend	2396.9247910863505	1795	1.84603345094448
11	1	Kennington Oval, Oval	weekday	1272.5028595196009	9617	2.0831341271372983
12	1	Cleaver Street, Kennington	weekday	1228.8052773027123	4093	1.4967922765165333
13	1	Kennington Station, Kennington	weekday	1689.5876923076924	4875	1.29866773569676

Share on Social Media