Create a basic Regression model with BigQuery ML

What is Regression model?

Regression models are statistical tools used to understand how one or more independent variables influence a dependent variable. Their main job is predicting continuous outcomes, like house prices or sales forecasts.

These models help us see how changes in independent variables affect the dependent one, holding other factors constant. For example, in real estate, we can predict house prices by considering variables like location and size.

Regression models come in various forms such as linear regression, multiple regression, polynomial regression, logistic regression, etc., each suited for different types of data and relationships between variables.

Today, we will start from the simplest one. Let’s create a basic regression model with BigQuery ML.

Create the dataset

In the Google Cloud console, create a dataset under your project_id. Later we will store the model in the dataset. For example, the dataset I created is called “ga_sample”

2. Add the Dataset

Click the ADD button, search the source “bigquery-public-data” and search for the dataset “google_analytics_sample”. Click the view the dataset button and then give a star.

Create a a basic Regression model

Data Source: bigquery-public-data.google_analytics_sample.ga_sessions_*

Data Schema:

Field name	Type	Mode
visitorId	INTEGER	NULLABLE
visitNumber	INTEGER	NULLABLE
visitId	INTEGER	NULLABLE
visitStartTime	INTEGER	NULLABLE
date	STRING	NULLABLE
totals	RECORD	NULLABLE
trafficSource	RECORD	NULLABLE
device	RECORD	NULLABLE
geoNetwork	RECORD	NULLABLE
customDimensions	RECORD	REPEATED
hits	RECORD	REPEATED
fullVisitorId	STRING	NULLABLE
userId	STRING	NULLABLE
clientId	STRING	NULLABLE
channelGrouping	STRING	NULLABLE
socialEngagementType	STRING	NULLABLE

Schema of the Dataset

The SQL Statement

CREATE OR REPLACE MODEL `ga_sample.sample_model`
OPTIONS(model_type='logistic_reg') AS
SELECT
  IF(totals.transactions IS NULL, 0, 1) AS label,
  IFNULL(device.operatingSystem, "") AS os,
  device.isMobile AS is_mobile,
  IFNULL(geoNetwork.country, "") AS country,
  IFNULL(totals.pageviews, 0) AS pageviews
FROM
  `bigquery-public-data.google_analytics_sample.ga_sessions_*`
WHERE
  _TABLE_SUFFIX BETWEEN '20160801' AND '20170531'
LIMIT 100000;

Let me break down the SQL code and explain line by line.

Create a model called sample_model under the dataset ga_sample
Declare the model type as logistic regression
IF(totals.transactions IS NULL, 0, 1) AS label : it creates a column called label and assign a value of 1 if the totals.transactions column is not NULL (indicating a transaction occurred), and 0 otherwise. This column will be the target variable for our logistic regression model.
IFNULL(device.operatingSystem, "") AS os: it creates a column called os, which represents the operating system used by the device. If device.operatingSystem is NULL, then it assigns an empty string (“”)
device.isMobile AS is_mobile: it creates a column called is_mobile, indicating whether the device is a mobile device (1 for true, 0 for false).
IFNULL(geoNetwork.country, "") AS country: it creates a column called country, representing the country of the user. Just like the os column, if geoNetwork.country is NULL, it assigns an empty string (“”).
IFNULL(totals.pageviews, 0) AS pageviews: it creates a column called pageviews, representing the number of pageviews. If totals.pageviews is NULL, it assigns a value of 0.
The WHERE clause filters the data to include only sessions that occurred between 2016/01/01 and 2017/05/31. The _TABLE_SUFFIX pseudo column represents the date suffix of the table name.
And at last, limits the number of rows returned by the query to 100,000.

After running the SQL statement, we will get a sample_model stored in the dataset ga_sample

Explore the Training detail

Loss

The loss function quantifies how well the model’s predictions match the actual target values during training. It represents the error between the predicted output and the true output. The goal of training a model is to minimise this loss function.

Duration

Training duration refers to the amount of time it takes to train a model on a dataset.

Learn rate

The learning rate is a hyperparameter that controls the size of the steps taken during the optimisation process to update the model parameters. It determines how quickly or slowly the model learns from the training data.

Explore the Model Performance

Share on Social Media