Skip to content

Create a basic Regression model with BigQuery ML

What is Regression model?

Regression models are statistical tools used to understand how one or more independent variables influence a dependent variable. Their main job is predicting continuous outcomes, like house prices or sales forecasts.

These models help us see how changes in independent variables affect the dependent one, holding other factors constant. For example, in real estate, we can predict house prices by considering variables like location and size.

Regression models come in various forms such as linear regression, multiple regression, polynomial regression, logistic regression, etc., each suited for different types of data and relationships between variables.

Today, we will start from the simplest one. Let’s create a basic regression model with BigQuery ML.

Create the dataset

  1. In the Google Cloud console, create a dataset under your project_id. Later we will store the model in the dataset. For example, the dataset I created is called “ga_sample”
create a dataset in bigquery console
create a dataset in bigquery console

2. Add the Dataset

Click the ADD button, search the source “bigquery-public-data” and search for the dataset “google_analytics_sample”. Click the view the dataset button and then give a star.

Create a a basic Regression model

Data Source: bigquery-public-data.google_analytics_sample.ga_sessions_*

Data Schema:

Field nameTypeMode
visitorIdINTEGERNULLABLE
visitNumberINTEGERNULLABLE
visitIdINTEGERNULLABLE
visitStartTimeINTEGERNULLABLE
dateSTRINGNULLABLE
totalsRECORDNULLABLE
trafficSourceRECORDNULLABLE
deviceRECORDNULLABLE
geoNetworkRECORDNULLABLE
customDimensionsRECORDREPEATED
hitsRECORDREPEATED
fullVisitorIdSTRINGNULLABLE
userIdSTRINGNULLABLE
clientIdSTRINGNULLABLE
channelGroupingSTRINGNULLABLE
socialEngagementTypeSTRINGNULLABLE
Schema of the Dataset

The SQL Statement

CREATE OR REPLACE MODEL `ga_sample.sample_model`
OPTIONS(model_type='logistic_reg') AS
SELECT
  IF(totals.transactions IS NULL, 0, 1) AS label,
  IFNULL(device.operatingSystem, "") AS os,
  device.isMobile AS is_mobile,
  IFNULL(geoNetwork.country, "") AS country,
  IFNULL(totals.pageviews, 0) AS pageviews
FROM
  `bigquery-public-data.google_analytics_sample.ga_sessions_*`
WHERE
  _TABLE_SUFFIX BETWEEN '20160801' AND '20170531'
LIMIT 100000;

Let me break down the SQL code and explain line by line.

  • Create a model called sample_model under the dataset ga_sample
  • Declare the model type as logistic regression
  • IF(totals.transactions IS NULL, 0, 1) AS label : it creates a column called label and assign a value of 1 if the totals.transactions column is not NULL (indicating a transaction occurred), and 0 otherwise. This column will be the target variable for our logistic regression model.
  • IFNULL(device.operatingSystem, "") AS os: it creates a column called os, which represents the operating system used by the device. If device.operatingSystem is NULL, then it assigns an empty string (“”)
  • device.isMobile AS is_mobile: it creates a column called is_mobile, indicating whether the device is a mobile device (1 for true, 0 for false).
  • IFNULL(geoNetwork.country, "") AS country: it creates a column called country, representing the country of the user. Just like the os column, if geoNetwork.country is NULL, it assigns an empty string (“”).
  • IFNULL(totals.pageviews, 0) AS pageviews: it creates a column called pageviews, representing the number of pageviews. If totals.pageviews is NULL, it assigns a value of 0.
  • The WHERE clause filters the data to include only sessions that occurred between 2016/01/01 and 2017/05/31. The _TABLE_SUFFIX pseudo column represents the date suffix of the table name.
  • And at last, limits the number of rows returned by the query to 100,000.

After running the SQL statement, we will get a sample_model stored in the dataset ga_sample

Explore the Training detail

Loss

The loss function quantifies how well the model’s predictions match the actual target values during training. It represents the error between the predicted output and the true output. The goal of training a model is to minimise this loss function.

Duration

Training duration refers to the amount of time it takes to train a model on a dataset.

Learn rate

The learning rate is a hyperparameter that controls the size of the steps taken during the optimisation process to update the model parameters. It determines how quickly or slowly the model learns from the training data.

Explore the Model Performance