library(tidymodels)
library(vip)
Web Sales Prediction
Libraries
Feature Engineering
Feature Scaling will be performed in order to transform numerical attributes to be on similar scales.
Since most Machine Learning algorithms prefer to work with numbers, we will convert the customer segment categories from text to numbers.
<- recipe(sales_per_visit ~ ., data = train) |>
m_res step_normalize(all_numeric_predictors()) |>
step_dummy(all_nominal_predictors())
m_res
Recipe
Inputs:
role #variables
outcome 1
predictor 7
Operations:
Centering and scaling for all_numeric_predictors()
Dummy variables from all_nominal_predictors()
Model
Three model will be trained and evaluated to see which of them will have the best potential at generalizing well on the test set this models are linear regression, decision tree and a random forest model.
<- linear_reg() |>
lr_mdl set_engine("lm")
<- decision_tree() |>
dt_mdl set_engine("rpart") |>
set_mode("regression")
<- rand_forest() |>
rf_mdl set_engine("ranger") |>
set_mode("regression")
Workflow
Creating a workflow pipeline which combines both feature engineering and the model specification together. this will be done for each model.
<- workflow() |>
lr_wf add_recipe(m_res) |>
add_model(lr_mdl)
<- workflow() |>
dt_wf add_recipe(m_res) |>
add_model(dt_mdl)
<- workflow() |>
rf_wf add_recipe(m_res) |>
add_model(rf_mdl)
Cross Validation
In order to better evaluate the model performance the training set will be divided further into smaller training set and a validation set, the models will then be trained against the smaller training set and evaluate them against the validation set. Note that there will be 10 separate folds (train set & validation set) and split will be done using a stratified sampling based on the newly created customer segment variable so that each set will have a random sample close to that of the population (the entire train set)
set.seed(1122)
<- vfold_cv(train, v = 10, strata = "customer_segment") v_folds
Linear Regression
::registerDoParallel()
doParallel<- lr_wf |>
lr_fit fit_resamples(resamples = v_folds,
metrics = metric_set(rmse, rsq, mae),
control = control_resamples(save_pred = TRUE, verbose = FALSE))
collect_metrics(lr_fit) |> cross_valid_tibble(model = "default")
Decision Tree
<- dt_wf |>
dt_fit fit_resamples(resamples = v_folds,
metrics = metric_set(rmse, rsq, mae),
control = control_resamples(save_pred = TRUE, verbose = FALSE))
collect_metrics(dt_fit) |> cross_valid_tibble(model = "default")
Random Forest
<- rf_wf |>
rf_fit fit_resamples(resamples = v_folds,
metrics = metric_set(rmse, rsq, mae),
control = control_resamples(save_pred = TRUE, verbose = FALSE))
collect_metrics(rf_fit) |> cross_valid_tibble(model = "default")
Evaluating the performance of the three models we can see that the random forest model made the least error when predicting sales per visit with RMSE of 46.0 and MAE of 22.2 way better than decision tree with RMSE of 58.1 and MAE of 34.8 linear regression performed worst of all three models.
hyper-parameter tuning
The model performance can be improved further by feeding the best parameter to the model. A grid search will be used to save time and reduce errors.
Creating a new tunable model for both decision tree and random forest model.
Decision Tree
Experimenting on three decision tree hyperparameters which are::
cost complexity :- the cost parameter (cp) used by CART model.
tree depth :- The maximum depth of the decision tree.
min n :- The minimum number of data points required for a node to be split further.
<- decision_tree(cost_complexity = tune(),
dt_t_mdl tree_depth = tune(),
min_n = tune()) |>
set_engine("rpart") |>
set_mode("regression")
# Update workflow model
<- update_model(dt_wf, dt_t_mdl)
dt_wf
set.seed(1221)
::registerDoParallel()
doParallel<- tune_grid(dt_wf,
dt_tune resamples = v_folds,
grid = 10,
metrics = metric_set(mae))
collect_metrics(dt_tune) |> cross_valid_tibble("dt")
show_best(dt_tune, metric = "mae") |> cross_valid_tibble("dt")
autoplot(dt_tune) + theme_minimal()
Based on the average MAE of all 10 split there is an increase in the model performance of which there is a reduction in the prediction error from 33.4 down to 20.1. let’s see how the random forest model performs.
Random Forest
Experimenting on two random forest hyperparameters which are::
mtry :- Number of predictors that will be randomly sampled at each split when creating the tree model.
min_n :- The minimum number of data points required for a node to be split further.
<- rand_forest(mtry = tune(),
rf_t_mdl trees = 1000,
min_n = tune()) |>
set_engine("ranger") |>
set_mode("regression")
# Update workflow model
<- update_model(rf_wf, rf_t_mdl)
rf_wf
set.seed(1221)
::registerDoParallel()
doParallel<- tune_grid(rf_wf,
rf_tune resamples = v_folds,
grid = 5,
metrics = metric_set(mae))
i Creating pre-processing data to finalize unknown parameter: mtry
collect_metrics(rf_tune) |> cross_valid_tibble("rf")
show_best(rf_tune, metric = "mae") |> cross_valid_tibble("rf")
autoplot(rf_tune) + theme_minimal()
The tunned random forest model have the best performance so far, with the lowest error of 20.4 on an average for all 10 split.
Feature Importance
<- rf_mdl |>
imp_spec finalize_model(select_best(rf_tune, metric = "mae")) |>
set_engine("ranger", importance = "permutation")
workflow() |>
add_recipe(m_res) |>
add_model(imp_spec) |>
fit(train) |>
extract_fit_parsnip() |>
vip(aesthetics = list(alpha = 0.9, fill = "#D3AB9E")) +
ggtitle("Feature Importance") +
scale_y_continuous(labels = scales::label_comma()) +
theme_minimal() +
theme(plot.title = element_text(color = "#888888"),
axis.title.x = element_text(color = "#8F8F8F"),
axis.text = element_text(color = "#8F8F8F"))
The number of visit from customers which involved a least a purchase is the most important variable when predicting sales per visit which is closely followed by the number of unique items purchased while the dummy variables for customer segment have the smallest contribution when predicting sales per visit.
Finalize workflow
Going forward the random forest will be used to predict sales per visit on the test set to see how well the model generalizes.
Load Testing data
<- readr::read_delim("data/clothing_store_PCA_test", delim = ",") |>
test clean_add_variable()
<- rf_wf |>
final_rf finalize_workflow(select_best(rf_tune, metric = "mae"))
Fit & Predict
# Clean and add variables to the test set
<- clean_add_variable(test)
test
<- fit(final_rf, train)
fit_workflow
<- predict(fit_workflow, test)
sales_prediction
<- sales_prediction |>
pred_df mutate(actual = test$sales_per_visit)
Evaluation
mae(pred_df, truth = actual, estimate = .pred) |>
reactable(
theme = reactableTheme(color = "#919191",
headerStyle = list(color = "#858585")),
columns = list(.estimate = colDef(format = colFormat(digits = 3))),
bordered = TRUE
)
numeric_summary(pred_df, .pred, FALSE) |>
mutate(variable = "Prediction") |>
bind_rows(numeric_summary(pred_df, actual, FALSE) |> mutate(variable = "Actual")) |>
relocate(variable, .before = 1) |>
mutate(across(where(is.numeric), ~round(.x, 2))) |>
rename_with(axis_label) |>
reactable(
theme = reactableTheme(color = "#919191",
headerStyle = list(color = "#858585")),
defaultColDef = colDef(format = colFormat(separators = TRUE,
prefix = "$")),
columns = list(
Variable = colDef(style = list(color = "#777777",
fontWeight = "bold"),
format = colFormat(prefix = NULL))
) )
|>
pred_df ggplot(aes(actual, .pred)) +
geom_abline(lty = 2, color = "#D3AB9E") +
geom_point(color = "#D3AB9E") +
scale_x_continuous(labels = scales::label_dollar()) +
scale_y_continuous(labels = scales::label_dollar()) +
theme_minimal() +
labs(x = "Actual", y = "Prediction",
title = "Actual Vs Predicted Sales Per Visit") +
theme(axis.title = element_text(color = "#5E5E5E"),
plot.title = element_text(color = "#5E5E5E"))
Prediction on the test set have a Mean Absolute Error of 20.5 which mean the final random forest model is not over fitting the data. Also from the scatter plot above there is a certain pattern of error for the actual and predicted sales per visit which shows smaller errors for sales below $300 and a higher error for sales above that amount. Another obvious error the model made is failing to predict sales per visit above $716.9.
Save Final Model
<- vetiver::vetiver_model(model = fit_workflow,
mdl_v model_name = "sales_random_forest")
<- pins::board_folder(path = "model_workflow", versioned = TRUE)
model_board
::vetiver_pin_write(model_board, mdl_v) vetiver
Function for future prediction of sales per visit.
#' Predict Sales per visit on new set of records.
#'
#' @param board_path path to the saved pin board.
#' @param mdl_version The model version.
#' @param new_data A new data set to predict.
#'
#' @return A tibble with a single variable ".pred" which holds the new predictions of sales per visit.
#' @export
#'
#' @examples predict_sale_per_visit("~/sales_prediction/board", new_web_records)
#'
<- function(board_path, mdl_version, new_data) {
predict_sale_per_visit <- pins::board_folder(path = board_path)
model_board
<- if (!missing(mdl_version)) mdl_version else NULL
vers
<- vetiver::vetiver_pin_read(board = model_board,
wf_model name = "sales_random_forest",
version = vers)
library(workflows)
predict(wf_model, new_data)
}