Skip to content

Make a Clustering model with BigQuery ML

Unsupervised Model

An unsupervised model is a type of machine learning model where the algorithm learns patterns, structures, or relationships within the data without explicit supervision or labeled responses. Unlike supervised learning, where the model learns from input-output pairs, unsupervised learning deals with unlabeled data, seeking to uncover hidden patterns or groupings within the data.

K-Means

K-Means is a popular unsupervised machine learning algorithm used for clustering data points into groups or clusters based on similarities in their feature space. It’s a simple and efficient algorithm that aims to partition a dataset into K clusters, where each data point belongs to the cluster with the nearest mean, serving as the cluster’s “centroid.”

K-Means has various applications, including customer segmentation, document clustering, image segmentation, and anomaly detection, among others.

Understand the raw data

Data source: google public dataset – bigquery-public-data.london_bicycles.cycle_hire

Data schema:

Field nameTypeMode
rental_idINTEGERREQUIRED
durationINTEGERNULLABLE
duration_msINTEGERNULLABLE
bike_idINTEGERNULLABLE
bike_modelSTRINGNULLABLE
end_dateTIMESTAMPNULLABLE
end_station_idINTEGERNULLABLE
end_station_nameSTRINGNULLABLE
start_dateTIMESTAMPNULLABLE
start_station_idINTEGERNULLABLE
start_station_nameSTRINGNULLABLE
end_station_logical_terminalINTEGERNULLABLE
start_station_logical_terminalINTEGERNULLABLE
end_station_priority_idINTEGERNULLABLE

This data contains the number of hires of London’s Santander Cycle Hire Scheme from 2011 to present. Data includes start and stop timestamps, station names and ride duration. It contains 83,434,866 number of rows and around 9.57 GB in total.

Transform the raw data

There are many reasons to make a clustering. In this scenario, I want to find out which stations are more naturally in common. According to the understanding of the dataset, we can extract some new features to make the clustering such as

  • Duration of rentals
  • Number of trips per day
  • Distance from city center

However, none of them exists in the raw data, we need to make a transformation, extracting the useful information and remove the less related ones.

WITH
hs AS (
SELECT
  h.start_station_name AS station_name,
  IF
  (EXTRACT(DAYOFWEEK
    FROM
      h.start_date) = 1
    OR EXTRACT(DAYOFWEEK
    FROM
      h.start_date) = 7,
    "weekend",
    "weekday") AS isweekday,
  h.duration,
  ST_DISTANCE(ST_GEOGPOINT(s.longitude,
      s.latitude),
    ST_GEOGPOINT(-0.1,
      51.5))/1000 AS distance_from_city_center
FROM
  `bigquery-public-data.london_bicycles.cycle_hire` AS h
JOIN
  `bigquery-public-data.london_bicycles.cycle_stations` AS s
ON
  h.start_station_id = s.id
WHERE
  h.start_date BETWEEN CAST('2015-01-01 00:00:00' AS TIMESTAMP)
  AND CAST('2016-01-01 00:00:00' AS TIMESTAMP) ),
stationstats AS (
SELECT
  station_name,
  isweekday,
  AVG(duration) AS duration,
  COUNT(duration) AS num_trips,
  MAX(distance_from_city_center) AS distance_from_city_center
FROM
  hs
GROUP BY
  station_name, isweekday )
SELECT
*
FROM
stationstats
ORDER BY
distance_from_city_center ASC

Then you will see the result like this. The transformed table contains 1592 rows.

Rowstation_nameisweekdaydurationnum_tripsdistance_from_city_center
1Borough Road, Elephant & Castleweekday1109.932162115150357490.12623965466425408
2Borough Road, Elephant & Castleweekend2125.0958286358517740.12623965466425408
3Webber Street , Southwarkweekday795.4396194568051965170.16402063786209384
4Webber Street , Southwarkweekend938.3570105003092316190.16402063786209384
5Great Suffolk Street, The Boroughweekday802.5302922309339284180.19366718830977991
6Great Suffolk Street, The Boroughweekend1018.310276679842120240.19366718830977991
7LSBU (Borough Road), Elephant & Castleweekday1228.174843889384656050.25790299799917293
8LSBU (Borough Road), Elephant & Castleweekend1564.716336295283614630.25790299799917293
9Harper Road, The Boroughweekday968.1826831588965731530.30630610581879264
10Harper Road, The Boroughweekend1443.665024630541910150.30630610581879

Create the Clustering Model

Based on the transformed result, we can start building the clustering model now.

CREATE OR REPLACE MODEL `london_bike.station_clusters`
  OPTIONS(model_type='kmeans', 
          num_clusters=4) AS
WITH
  hs AS (
  SELECT
    h.start_station_name AS station_name,
  IF
    (EXTRACT(DAYOFWEEK
      FROM
        h.start_date) = 1
      OR EXTRACT(DAYOFWEEK
      FROM
        h.start_date) = 7,
      "weekend",
      "weekday") AS isweekday,
    h.duration,
    ST_DISTANCE(ST_GEOGPOINT(s.longitude,
        s.latitude),
      ST_GEOGPOINT(-0.1,
        51.5))/1000 AS distance_from_city_center
  FROM
    `bigquery-public-data.london_bicycles.cycle_hire` AS h
  JOIN
    `bigquery-public-data.london_bicycles.cycle_stations` AS s
  ON
    h.start_station_id = s.id
  WHERE
    h.start_date BETWEEN CAST('2015-01-01 00:00:00' AS TIMESTAMP)
    AND CAST('2016-01-01 00:00:00' AS TIMESTAMP) ),
  stationstats AS (
  SELECT
    station_name,
    isweekday,
    AVG(duration) AS duration,
    COUNT(duration) AS num_trips,
    MAX(distance_from_city_center) AS distance_from_city_center
  FROM
    hs
  GROUP BY
    station_name, isweekday)
SELECT
  * EXCEPT(station_name, isweekday)
FROM
  stationstats
 

Model Evaluation

Evaluation results can help you to interpret and understand the different clusters. For example, centroid 3 shows a busy city station that is close to the city center. Centroid 2 shows the second city station which is less busy and used for longer duration rentals. Centroid 1 shows a less busy city station, with shorter duration rentals. Centroid 4 shows a suburban station with trips that are longer.

Prediction

WITH
  hs AS (
  SELECT
    h.start_station_name AS station_name,
    IF
    (EXTRACT(DAYOFWEEK
      FROM
        h.start_date) = 1
      OR EXTRACT(DAYOFWEEK
      FROM
        h.start_date) = 7,
      "weekend",
      "weekday") AS isweekday,
    h.duration,
    ST_DISTANCE(ST_GEOGPOINT(s.longitude,
        s.latitude),
      ST_GEOGPOINT(-0.1,
        51.5))/1000 AS distance_from_city_center
  FROM
    `bigquery-public-data.london_bicycles.cycle_hire` AS h
  JOIN
    `bigquery-public-data.london_bicycles.cycle_stations` AS s
  ON
    h.start_station_id = s.id
  WHERE
    h.start_date BETWEEN CAST('2015-01-01 00:00:00' AS TIMESTAMP)
    AND CAST('2016-01-01 00:00:00' AS TIMESTAMP) ),
  stationstats AS (
  SELECT
    station_name,
    isweekday,
    AVG(duration) AS duration,
    COUNT(duration) AS num_trips,
    MAX(distance_from_city_center) AS distance_from_city_center
  FROM
    hs
  GROUP BY
    station_name, isweekday )
SELECT
  * EXCEPT(nearest_centroids_distance)
FROM
  ML.PREDICT( MODEL `instruction-415216.london_bike.station_clusters`,
    (
    SELECT
      *
    FROM
      stationstats
    WHERE
      REGEXP_CONTAINS(station_name, 'Kennington')))

The result will look like this

RowCENTROID_IDstation_nameisweekdaydurationnum_tripsdistance_from_city_center
11Cleaver Street, Kenningtonweekend1239.416809605488711661.4967922765165333
21Kennington Road , Vauxhallweekend1720.598319327731629750.89156462775600731
31Kennington Cross, Kenningtonweekend1160.482215325853141891.4625875338501981
41Kennington Oval, Ovalweekend1503.619246861924928682.0831341271372983
51Cotton Garden Estate, Kenningtonweekday1572.919254658385253131.1170336205776936
64Kennington Station, Kenningtonweekend3579.285714285713818481.298667735696762
71Kennington Lane Rail Bridge, Vauxhallweekend1812.217790187474750142.1750328347653012
81Doddington Grove, Kenningtonweekend1972.409638554216821581.4681405273793819
93Kennington Cross, Kenningtonweekday911.52387777705371153491.4625875338501981
101Kennington Road Post Office, Ovalweekend2396.924791086350517951.84603345094448
111Kennington Oval, Ovalweekday1272.502859519600996172.0831341271372983
121Cleaver Street, Kenningtonweekday1228.805277302712340931.4967922765165333
131Kennington Station, Kenningtonweekday1689.587692307692448751.29866773569676