Skip to content

Make a Time Series Forecasting model with BigQuery ML

image 6
  1. What is Time Series Data?
  2. What is ARIMA model?
  3. Overview the dataset
  4. Prepare the training data
  5. Create the ARIMA model
  6. Evaluate the model
  7. Make a Prediction

What is Time Series Data ?


Time series data is a type of data where observations are collected, recorded, or measured at successive points in time, typically at regular intervals. Time series data can be decomposed into trend, seasonality, cyclicality, and irregularity components. Understanding and modelling these components are essential for accurate forecasting and analysis of time series data.

  1. Trend:
    • The long-term movement or directionality of the data over time. It captures the overall tendency of the data to increase, decrease, or remain stable.
    • Trends can be upward (increasing), downward (decreasing), or horizontal (no significant change).
  2. Seasonality:
    • The repetitive and predictable patterns that occur at regular intervals within the data, often corresponding to specific time periods such as days, weeks, months, or seasons.
    • Seasonal patterns can be daily, weekly, monthly, quarterly, or annual, depending on the nature of the data and the context.
  3. Cyclicality:
    • Similar to seasonality, cyclicality involves repetitive patterns; however, these patterns do not have fixed periods and may occur over longer time spans.
    • Cycles typically represent fluctuations around the trend that are not of fixed frequency or duration.
  4. Irregularity (or Residual):
    • The random and unpredictable fluctuations or noise in the data that cannot be attributed to the trend, seasonality, or cyclicality.
    • Irregular components represent the unexplained variation in the data and are often caused by random events, outliers, or measurement errors.

What is ARIMA model ?

ARIMA (AutoRegressive Integrated Moving Average) is a popular statistical method used for time series forecasting. It’s a flexible and powerful approach that can capture a wide range of temporal patterns in data. ARIMA models are especially useful when dealing with non-stationary time series data, where the statistical properties of the series change over time.

ARIMA models consist of three main components:

  1. AutoRegressive (AR) Component:
    • The autoregressive component represents the relationship between an observation and a number of lagged observations (i.e., previous time steps).
    • The term “autoregressive” refers to the idea that the value at any given time is linearly related to its own past values.
    • In an AR(p) model, the current value of the time series is modeled as a linear combination of its past p values.
  2. Integrated (I) Component:
    • The integrated component refers to the differencing of raw observations (subtracting an observation from its previous observation) in order to make the time series stationary.
    • Stationarity implies that the statistical properties of the time series (such as mean and variance) remain constant over time.
    • The degree of differencing required to make the series stationary determines the “order of integration” (denoted by the ‘d’ parameter). If no differencing is necessary, ‘d’ is set to 0.
  3. Moving Average (MA) Component:
    • The moving average component models the relationship between an observation and a residual error from a moving average model applied to lagged observations.
    • Unlike the autoregressive component, which incorporates past values of the series, the moving average component incorporates past forecast errors.
    • In an MA(q) model, the current value of the time series is modeled as a linear combination of past q forecast errors.

Overview the Dataset

SELECT
   bike_id,
   EXTRACT (DATE FROM TIMESTAMP(start_date)) AS start_date,
   start_station_name,
   end_station_name,
FROM
  `bigquery-public-data.london_bicycles.cycle_hire`
  WHERE start_date BETWEEN '2016-01-01' AND '2017-01-01'
  LIMIT 5
Rowbike_idstart_datestart_station_nameend_station_name
195462016-09-03St. Martin’s Street, West EndCrosswall, Tower
284552016-08-31Hyde Park Corner, Hyde ParkPark Lane, Mayfair
3125402016-08-31Wardour Street, SohoOld Quebec Street, Marylebone
418012016-09-03George Place Mews, MaryleboneNorthdown Street, King’s Cross
578902016-09-04Bishop’s Bridge Road West, BayswaterJubilee Gardens, South Bank
SELECT
   EXTRACT (DATE FROM TIMESTAMP(start_date)) AS start_date,
   start_station_name,
  COUNT(*) AS total_trips
FROM
  `bigquery-public-data.london_bicycles.cycle_hire`
  WHERE start_date BETWEEN '2016-01-01' AND '2017-01-01'
  GROUP BY
    start_station_name, start_date
LIMIT 5
Rowstart_datestart_station_nametotal_trips
12016-09-05Bermondsey Street, Bermondsey66
22016-09-03Exhibition Road Museums, South Kensington65
32016-09-04Good’s Way, King’s Cross46
42016-09-01Hammersmith Town Hall, Hammersmith25
52016-09-01Charles II Street, West End53


Prepare the training data

  • Create a dataset called ‘london_bike’
  • Extract the related data and stored in a table called ‘training_data’
SELECT
   EXTRACT (DATE FROM TIMESTAMP(start_date)) AS start_date,
   start_station_name,
  COUNT(*) AS total_trips
FROM
  `bigquery-public-data.london_bicycles.cycle_hire`
  WHERE start_station_name LIKE '%Kensington%'
  GROUP BY
    start_station_name, start_date
  HAVING   start_date BETWEEN '2016-01-01' AND '2017-01-01'

You will see the training data like this

Rowstart_datestart_station_nametotal_trips
12016-05-31Argyll Road, Kensington3
22016-11-04Argyll Road, Kensington4
32017-01-01Argyll Road, Kensington4
42016-03-08Argyll Road, Kensington5
52016-11-29Argyll Road, Kensington5
62016-02-05Argyll Road, Kensington6

If we group the total trip by each date and visualise it, the graph will look like this:

Create the ARIMA model

CREATE OR REPLACE MODEL london_bike.arima_model
  OPTIONS(
    MODEL_TYPE='ARIMA',
    TIME_SERIES_TIMESTAMP_COL='start_date',
    TIME_SERIES_DATA_COL='total_trips',
    TIME_SERIES_ID_COL='start_station_name'
  ) AS
  SELECT
    start_date,
    start_station_name,
    total_trips
  FROM
    london_bike.training_data

Evaluate the ARIMA model

SELECT
  *
FROM
  ML.EVALUATE(MODEL london_bike.arima_model)
Rowstart_station_namenon_seasonal_pnon_seasonal_dnon_seasonal_qhas_driftlog_likelihoodAICvarianceseasonal_periods
1Abingdon Villas, Kensington015false-1056.62413164167812125.248263283356218.297407336004053WEEKLY
2Argyll Road, Kensington015false-1158.68929236860372329.378584737207332.014399603264621WEEKLY
3Barons Court Station, West Kensington015false-1210.18256294648472432.365125892969443.44947290938WEEKLY
4Bevington Road West, North Kensington213false-610.98651112270341233.973022245406832.723647125570743WEEKLY
5Bevington Road, North Kensington311false-468.18303823240092946.3660764648018410.548901269687196WEEKLY

There are five models trained, one for each of the stations in the training data.

  • The first four columns (non_seasonal_{p,d,q} and has_drift) define the ARIMA model.
  • The next three metrics (log_likelihoodAIC, and variance) are relevant to the ARIMA model fitting process.

The fitting process determines the best ARIMA model by using the auto.ARIMA algorithm, one for each time series. Of these metrics, AIC is typically the go-to metric to evaluate how well a time series model fits the data while penalising overly complex models.

Make a Prediction

For example, let’s predict the next 15 days trip at a confidence level of 90%

  • The horizon defines the number of future time steps for which predictions will be generated.
  • The confidence level indicates the level of confidence associated with the predictions.
SELECT * FROM ML.FORECAST(MODEL london_bike.arima_model, 
                  STRUCT(15 AS horizon, 0.9 AS confidence_level))

You will see result like this

start_station_nameforecast_timestampforecast_valuestandard_errorconfidence_levelprediction_interval_lower_boundprediction_interval_upper_boundconfidence_interval_lower_boundconfidence_interval_upper_bound
Abingdon Villas, Kensington2017-01-10 00:00:00 UTC16.9373302814011386.52081779233596670.96.223162531854370627.6514980309479066.223162531854370627.651498030947906
Argyll Road, Kensington2017-01-10 00:00:00 UTC11.0313110517873847.73668799249624680.9-1.680617975724114623.743240079298882-1.680617975724114623.743240079298882
Barons Court Station, West Kensington2017-01-10 00:00:00 UTC28.2752446232115138.41263819472109730.914.45268130294473442.09780794347829414.45268130294473442.097807943478294
Bevington Road West, North Kensington2017-01-10 00:00:00 UTC8.787767090380987.92297784175263690.9-4.230249418568291721.805783599330251-4.230249418568291721.805783599330251
Bevington Road, North Kensington2017-01-10 00:00:00 UTC3.045372380010453.75859700237356040.9-3.13027001160863269.2210147716295339-3.13027001160863269.221014771

The standard error measures the variability or dispersion of the forecast errors. It indicates the typical deviation of the observed values from the forecasted values. A smaller standard error implies that the forecast is more accurate and reliable.

The prediction interval represents a range within which the true future value is expected to fall with a certain level of confidence. The lower bound and upper bound of the prediction interval define the endpoints of this range.

The confidence interval also represents a range within which the true future value is expected to fall with a certain level of confidence. The lower bound and upper bound of the confidence interval define the endpoints of this range.

The Predicted Trips and Actual Trips