Skip to content

Detect Anomaly with Time Series Model (ARIMA PLUS XREG) in BigQuery ML

Anomaly detection in time-series data is vital for maintaining data integrity, ensuring system reliability, and enabling proactive decision-making. By identifying unusual patterns, anomalies act as an early warning system, alerting to potential issues before they escalate. They also facilitate performance monitoring, fault diagnosis, and fraud detection across various domains.

ARIMA Plus model has a great performance for anomaly detection in time-series data. It can effectively capture seasonal patterns in time-series data, enabling the detection of anomalies that occur within specific time periods or follow recurring patterns. Meanwhile, it provides accurate forecasts by identifying deviations from expected behaviour.

  1. Prepare the training data
  2. Create a ARIMA PLUS XREG model
  3. Use the model to detect anomalies

Prepare the training data

  1. Create a dataset called air_quality
  2. Extract the daily data from pm25, wind_speed, temperature table, union them together, order by date, and stored in a new table called Florida_air_quality_daily
CREATE TABLE `air_quality.Florida_air_quality_daily`
AS
WITH
  pm25_daily AS (
    SELECT
      avg(arithmetic_mean) AS pm25, date_local AS date
    FROM
      `bigquery-public-data.epa_historical_air_quality.pm25_nonfrm_daily_summary`
    WHERE
      state_name = 'Florida'
      AND parameter_name = 'Acceptable PM2.5 AQI & Speciation Mass'
    GROUP BY date_local
  ),
  wind_speed_daily AS (
    SELECT
      avg(arithmetic_mean) AS wind_speed, date_local AS date
    FROM
      `bigquery-public-data.epa_historical_air_quality.wind_daily_summary`
    WHERE
      state_name = 'Hawaii' AND parameter_name = 'Wind Speed - Resultant'
    GROUP BY date_local
  ),
  temperature_daily AS (
    SELECT
      avg(first_max_value) AS temperature, date_local AS date
    FROM
      `bigquery-public-data.epa_historical_air_quality.temperature_daily_summary`
    WHERE
      state_name = 'Florida' AND parameter_name = 'Outdoor Temperature'
    GROUP BY date_local
  )

SELECT
  pm25_daily.date AS date, pm25, wind_speed, temperature
FROM pm25_daily
JOIN wind_speed_daily USING (date)
JOIN temperature_daily USING (date)
ORDER BY date

Then you will see the result like this, the new table contains 8939 records.

Rowdatepm25wind_speedtemperature
11991-02-023.6665.80909182.0
21991-02-063.7274.5580.0
31991-02-0915.65743.69583370.0
41991-02-138.206697.237576.0
51991-02-167.591462.69583356.0

Create a ARIMA PLUS XREG model

ARIMA PLUS XREG is a forecasting model that combines elements of ARIMA (AutoRegressive Integrated Moving Average) models with exogenous variables (XREG).

ARIMA (AutoRegressive Integrated Moving Average): ARIMA models are a class of statistical models used for analyzing and forecasting time series data. They involve three key components: AutoRegressive (AR) terms, Integrated (I) terms, and Moving Average (MA) terms. AR terms represent the correlation between the current value and its past values, MA terms capture the relationship between the current value and the residual errors from a moving average model applied to lagged observations, and I terms represent differencing of the time series data to achieve stationarity.

XREG (Exogenous Variables): These are external variables that are believed to have a causal effect on the time series being analysed. Unlike the endogenous variables in the time series itself, exogenous variables are not modelled as part of the ARIMA process but are instead incorporated directly into the forecasting model. These variables can provide additional information to improve the accuracy of forecasts.

When you combine ARIMA with exogenous variables (XREG), you create a more flexible model that can capture both the internal dynamics of the time series data (modelled by ARIMA) and the influence of external factors (captured by the exogenous variables).

CREATE OR REPLACE MODEL `air_quality.ARIMA_PLUS_XREG_model`
  OPTIONS (
    model_type = 'ARIMA_PLUS_XREG',
    auto_arima=TRUE,
    time_series_data_col = 'temperature',
    time_series_timestamp_col = 'date'
    )
AS
SELECT
  *
FROM
  `air_quality.Florida_air_quality_daily`;

In the evaluation, you will see result like this.

If you are not familiar with the meaning of each column, please read this article

Make a Time Series Forecasting model with BigQuery ML

Non-seasonal P Non-seasonal D Non-seasonal Q Has driftHas spikes and dipsHas holiday effectHas step changesLog likelihoodAICVarianceSeasonal period
213FalseTrueFalseTrue-23,166.47246,344.9453.433Weekly, Yearly
213FalseTrueFalseTrue-23,179.29946,372.5993.441Weekly, Yearly
312FalseTrueFalseTrue-23,187.54146,387.0813.446Weekly, Yearly
112FalseTrueFalseTrue-23,215.26146,438.5233.463Weekly, Yearly
Swipe to see all columns

Use the model to detect anomalies

SELECT
  *
FROM
  ML.DETECT_ANOMALIES (
   MODEL `instruction-415216.air_quality.ARIMA_PLUS_XREG_model`,
   STRUCT(0.7 AS anomaly_prob_threshold)
  )
ORDER BY
  date ASC;

STRUCT(0.7 AS anomaly_prob_threshold): it defines the threshold for anomaly probability. In this case, anomalies are detected when the probability of being an anomaly is greater than or equal to 0.7.


You will observe results similar to this one, wherein Raw 8 has been flagged as an anomaly due to its anomaly probability of 0.96, surpassing the threshold value of 0.7.

Rowdatetemperatureis_anomalylower_boundupper_boundanomaly_probability
11991-02-02 00:00:00 UTC82.0false80.0778252487859783.922174751214030.0
21991-02-03 00:00:00 UTCnullnullnullnullnull
31991-02-04 00:00:00 UTCnullnullnullnullnull
41991-02-05 00:00:00 UTCnullnullnullnullnull
51991-02-06 00:00:00 UTC80.0false79.70613932517159183.5504888275996510.6202221021141151
61991-02-07 00:00:00 UTCnullnullnullnullnull
71991-02-08 00:00:00 UTCnullnullnullnullnull
81991-02-09 00:00:00 UTC70.0true71.9732404207216875.817589923149740.9647422106196

You can apply the model to new data and detect the anomalies,too.