What is Regression model?
Regression models are statistical tools used to understand how one or more independent variables influence a dependent variable. Their main job is predicting continuous outcomes, like house prices or sales forecasts.
These models help us see how changes in independent variables affect the dependent one, holding other factors constant. For example, in real estate, we can predict house prices by considering variables like location and size.
Regression models come in various forms such as linear regression, multiple regression, polynomial regression, logistic regression, etc., each suited for different types of data and relationships between variables.
Today, we will start from the simplest one. Let’s create a basic regression model with BigQuery ML.
Create the dataset
- In the Google Cloud console, create a dataset under your project_id. Later we will store the model in the dataset. For example, the dataset I created is called “ga_sample”
2. Add the Dataset
Click the ADD button, search the source “bigquery-public-data” and search for the dataset “google_analytics_sample”. Click the view the dataset button and then give a star.
Create a a basic Regression model
Data Source: bigquery-public-data.google_analytics_sample.ga_sessions_*
Data Schema:
Field name | Type | Mode |
---|---|---|
visitorId | INTEGER | NULLABLE |
visitNumber | INTEGER | NULLABLE |
visitId | INTEGER | NULLABLE |
visitStartTime | INTEGER | NULLABLE |
date | STRING | NULLABLE |
totals | RECORD | NULLABLE |
trafficSource | RECORD | NULLABLE |
device | RECORD | NULLABLE |
geoNetwork | RECORD | NULLABLE |
customDimensions | RECORD | REPEATED |
hits | RECORD | REPEATED |
fullVisitorId | STRING | NULLABLE |
userId | STRING | NULLABLE |
clientId | STRING | NULLABLE |
channelGrouping | STRING | NULLABLE |
socialEngagementType | STRING | NULLABLE |
The SQL Statement
CREATE OR REPLACE MODEL `ga_sample.sample_model` OPTIONS(model_type='logistic_reg') AS SELECT IF(totals.transactions IS NULL, 0, 1) AS label, IFNULL(device.operatingSystem, "") AS os, device.isMobile AS is_mobile, IFNULL(geoNetwork.country, "") AS country, IFNULL(totals.pageviews, 0) AS pageviews FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*` WHERE _TABLE_SUFFIX BETWEEN '20160801' AND '20170531' LIMIT 100000;
Let me break down the SQL code and explain line by line.
- Create a model called sample_model under the dataset ga_sample
- Declare the model type as logistic regression
IF(totals.transactions IS NULL, 0, 1) AS label
: it creates a column called label and assign a value of 1 if thetotals.transactions
column is not NULL (indicating a transaction occurred), and 0 otherwise. This column will be the target variable for our logistic regression model.IFNULL(device.operatingSystem, "") AS os
: it creates a column called os, which represents the operating system used by the device. Ifdevice.operatingSystem
is NULL, then it assigns an empty string (“”)device.isMobile AS is_mobile
: it creates a column called is_mobile, indicating whether the device is a mobile device (1
for true,0
for false).IFNULL(geoNetwork.country, "") AS country
: it creates a column called country, representing the country of the user. Just like the os column, ifgeoNetwork.country
is NULL, it assigns an empty string (“”).IFNULL(totals.pageviews, 0) AS pageviews
: it creates a column calledpageviews
, representing the number of pageviews. Iftotals.pageviews
is NULL, it assigns a value of 0.- The WHERE clause filters the data to include only sessions that occurred between 2016/01/01 and 2017/05/31. The
_TABLE_SUFFIX
pseudo column represents the date suffix of the table name. - And at last, limits the number of rows returned by the query to 100,000.
After running the SQL statement, we will get a sample_model stored in the dataset ga_sample
Explore the Training detail
Loss
The loss function quantifies how well the model’s predictions match the actual target values during training. It represents the error between the predicted output and the true output. The goal of training a model is to minimise this loss function.
Duration
Training duration refers to the amount of time it takes to train a model on a dataset.
Learn rate
The learning rate is a hyperparameter that controls the size of the steps taken during the optimisation process to update the model parameters. It determines how quickly or slowly the model learns from the training data.