Anomaly detection in time-series data is vital for maintaining data integrity, ensuring system reliability, and enabling proactive decision-making. By identifying unusual patterns, anomalies act as an early warning system, alerting to potential issues before they escalate. They also facilitate performance monitoring, fault diagnosis, and fraud detection across various domains.
ARIMA Plus model has a great performance for anomaly detection in time-series data. It can effectively capture seasonal patterns in time-series data, enabling the detection of anomalies that occur within specific time periods or follow recurring patterns. Meanwhile, it provides accurate forecasts by identifying deviations from expected behaviour.
Prepare the training data
- Create a dataset called air_quality
- Extract the daily data from
pm25, wind_speed, temperature
table, union them together, order by date, and stored in a new table called Florida_air_quality_daily
CREATE TABLE `air_quality.Florida_air_quality_daily` AS WITH pm25_daily AS ( SELECT avg(arithmetic_mean) AS pm25, date_local AS date FROM `bigquery-public-data.epa_historical_air_quality.pm25_nonfrm_daily_summary` WHERE state_name = 'Florida' AND parameter_name = 'Acceptable PM2.5 AQI & Speciation Mass' GROUP BY date_local ), wind_speed_daily AS ( SELECT avg(arithmetic_mean) AS wind_speed, date_local AS date FROM `bigquery-public-data.epa_historical_air_quality.wind_daily_summary` WHERE state_name = 'Hawaii' AND parameter_name = 'Wind Speed - Resultant' GROUP BY date_local ), temperature_daily AS ( SELECT avg(first_max_value) AS temperature, date_local AS date FROM `bigquery-public-data.epa_historical_air_quality.temperature_daily_summary` WHERE state_name = 'Florida' AND parameter_name = 'Outdoor Temperature' GROUP BY date_local ) SELECT pm25_daily.date AS date, pm25, wind_speed, temperature FROM pm25_daily JOIN wind_speed_daily USING (date) JOIN temperature_daily USING (date) ORDER BY date
Then you will see the result like this, the new table contains 8939 records.
Row | date | pm25 | wind_speed | temperature |
---|---|---|---|---|
1 | 1991-02-02 | 3.666 | 5.809091 | 82.0 |
2 | 1991-02-06 | 3.727 | 4.55 | 80.0 |
3 | 1991-02-09 | 15.6574 | 3.695833 | 70.0 |
4 | 1991-02-13 | 8.20669 | 7.2375 | 76.0 |
5 | 1991-02-16 | 7.59146 | 2.695833 | 56.0 |
Create a ARIMA PLUS XREG model
ARIMA PLUS XREG is a forecasting model that combines elements of ARIMA (AutoRegressive Integrated Moving Average) models with exogenous variables (XREG).
ARIMA (AutoRegressive Integrated Moving Average): ARIMA models are a class of statistical models used for analyzing and forecasting time series data. They involve three key components: AutoRegressive (AR) terms, Integrated (I) terms, and Moving Average (MA) terms. AR terms represent the correlation between the current value and its past values, MA terms capture the relationship between the current value and the residual errors from a moving average model applied to lagged observations, and I terms represent differencing of the time series data to achieve stationarity.
XREG (Exogenous Variables): These are external variables that are believed to have a causal effect on the time series being analysed. Unlike the endogenous variables in the time series itself, exogenous variables are not modelled as part of the ARIMA process but are instead incorporated directly into the forecasting model. These variables can provide additional information to improve the accuracy of forecasts.
When you combine ARIMA with exogenous variables (XREG), you create a more flexible model that can capture both the internal dynamics of the time series data (modelled by ARIMA) and the influence of external factors (captured by the exogenous variables).
CREATE OR REPLACE MODEL `air_quality.ARIMA_PLUS_XREG_model` OPTIONS ( model_type = 'ARIMA_PLUS_XREG', auto_arima=TRUE, time_series_data_col = 'temperature', time_series_timestamp_col = 'date' ) AS SELECT * FROM `air_quality.Florida_air_quality_daily`;
In the evaluation, you will see result like this.
If you are not familiar with the meaning of each column, please read this article
Make a Time Series Forecasting model with BigQuery ML
Non-seasonal P | Non-seasonal D | Non-seasonal Q | Has drift | Has spikes and dips | Has holiday effect | Has step changes | Log likelihood | AIC | Variance | Seasonal period |
---|---|---|---|---|---|---|---|---|---|---|
2 | 1 | 3 | False | True | False | True | -23,166.472 | 46,344.945 | 3.433 | Weekly, Yearly |
2 | 1 | 3 | False | True | False | True | -23,179.299 | 46,372.599 | 3.441 | Weekly, Yearly |
3 | 1 | 2 | False | True | False | True | -23,187.541 | 46,387.081 | 3.446 | Weekly, Yearly |
1 | 1 | 2 | False | True | False | True | -23,215.261 | 46,438.523 | 3.463 | Weekly, Yearly |
Use the model to detect anomalies
SELECT * FROM ML.DETECT_ANOMALIES ( MODEL `instruction-415216.air_quality.ARIMA_PLUS_XREG_model`, STRUCT(0.7 AS anomaly_prob_threshold) ) ORDER BY date ASC;
STRUCT(0.7 AS anomaly_prob_threshold)
: it defines the threshold for anomaly probability. In this case, anomalies are detected when the probability of being an anomaly is greater than or equal to 0.7.
You will observe results similar to this one, wherein Raw 8 has been flagged as an anomaly due to its anomaly probability of 0.96, surpassing the threshold value of 0.7.
Row | date | temperature | is_anomaly | lower_bound | upper_bound | anomaly_probability |
---|---|---|---|---|---|---|
1 | 1991-02-02 00:00:00 UTC | 82.0 | false | 80.07782524878597 | 83.92217475121403 | 0.0 |
2 | 1991-02-03 00:00:00 UTC | null | null | null | null | null |
3 | 1991-02-04 00:00:00 UTC | null | null | null | null | null |
4 | 1991-02-05 00:00:00 UTC | null | null | null | null | null |
5 | 1991-02-06 00:00:00 UTC | 80.0 | false | 79.706139325171591 | 83.550488827599651 | 0.6202221021141151 |
6 | 1991-02-07 00:00:00 UTC | null | null | null | null | null |
7 | 1991-02-08 00:00:00 UTC | null | null | null | null | null |
8 | 1991-02-09 00:00:00 UTC | 70.0 | true | 71.97324042072168 | 75.81758992314974 | 0.9647422106196 |
You can apply the model to new data and detect the anomalies,too.