In the tutorial “Create a basic Regression model with BigQuery ML” we have discussed how to create a dataset and how to create a regression model, today we will go deeper, to learn how to evaluate a model and make predictions with SQL statement.
Evaluate a model
SELECT * FROM ml.EVALUATE(MODEL `ga_sample.sample_model`, ( SELECT IF(totals.transactions IS NULL, 0, 1) AS label, IFNULL(device.operatingSystem, "") AS os, device.isMobile AS is_mobile, IFNULL(geoNetwork.country, "") AS country, IFNULL(totals.pageviews, 0) AS pageviews FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*` WHERE _TABLE_SUFFIX BETWEEN '20170601' AND '20170801'));
Here we use ml.EVALUATE()
function. It applies the model to the evaluation dataset to make predictions and compares these predictions to the actual target values in the evaluation dataset. Based on this comparison, it calculates various performance metrics to evaluate the model’s effectiveness, such as accuracy, precision, recall, F1 score, and more, depending on the type of machine learning task.
Typically, we can divide dataset into training and testing subsets using a split ratio of 80:20. In this particular instance, we perform the split based on timestamps.
Training data: the data between 2016/01/01 and 2017/05/31 is training data
Testing data: the data between 2017/06/01 and 2017/08/01 is testing data
Then we will see result like this:
precision | recall | accuracy | f1_score | log_loss | roc_auc |
---|---|---|---|---|---|
0.45098039215686275 | 0.10247524752475247 | 0.98503037420439887 | 0.16700282371924163 | 0.046486256280837281 | 0.98130169830169833 |
Make Predictions
1. Predict purchases per country
SELECT country, SUM(predicted_label) as total_predicted_purchases FROM ml.PREDICT(MODEL `ga_sample.sample_model`, ( SELECT IFNULL(device.operatingSystem, "") AS os, device.isMobile AS is_mobile, IFNULL(totals.pageviews, 0) AS pageviews, IFNULL(geoNetwork.country, "") AS country FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*` WHERE _TABLE_SUFFIX BETWEEN '20160801' AND '20170801')) GROUP BY country ORDER BY total_predicted_purchases DESC LIMIT 10;
Here we use ML.PREDICT
() function, which is a powerful tool for deploying machine learning models in BigQuery. It allows you to easily apply the trained models to new data and obtain predictions without having to write complex prediction code.
In this case, I want to predict the top 10 countries with the highest total predicted purchases.
We will get result like this:
Row | country | total_predicted_purchases |
---|---|---|
1 | United States | 3700 |
2 | Canada | 101 |
3 | Venezuela | 66 |
4 | Japan | 35 |
5 | Taiwan | 27 |
6 | Australia | 26 |
7 | India | 20 |
8 | Singapore | 19 |
9 | Hong Kong | 19 |
10 | Mexico | 15 |
2. Predict purchases per user
SELECT fullVisitorId, SUM(predicted_label) as total_predicted_purchases FROM ml.PREDICT(MODEL `ga_sample.sample_model`, ( SELECT IFNULL(device.operatingSystem, "") AS os, device.isMobile AS is_mobile, IFNULL(totals.pageviews, 0) AS pageviews, IFNULL(geoNetwork.country, "") AS country, fullVisitorId FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*` WHERE _TABLE_SUFFIX BETWEEN '20170701' AND '20170801')) GROUP BY fullVisitorId ORDER BY total_predicted_purchases DESC LIMIT 10;
In this case, I want to predict the top 10 users with the highest total predicted purchases.
We will get result like this:
Row | fullVisitorId | total_predicted_purchases |
---|---|---|
1 | 9417857471295131045 | 4 |
2 | 112288330928895942 | 2 |
3 | 2969418676126258798 | 2 |
4 | 2105122376016897629 | 2 |
5 | 1280993661204347450 | 2 |
6 | 8388931032955052746 | 2 |
7 | 7420300501523012460 | 2 |
8 | 0376394056092189113 | 2 |
9 | 8064625150033508396 | 2 |
10 | 806992249032686650 | 2 |