Machine Learning
in R with tidymodels

Jeffrey M. Girard | University of Kansas
Technology in Psychiatry Summit 2022

Overview

Instructor

Jeffrey Girard, PhD
www.jmgirard.com
jmgirard@ku.edu

Background

Assistant Professor, University of Kansas
Clinical Psychological Science Program
Brain, Behavior, and Quantitative Science Program

Research Areas

Psychological/Clinical Assessment
Affective/Interpersonal Communication
Applied Statistics and Machine Learning
Data Science and Software Engineering

Workshop Overview

Conceptual Introductions
Walkthrough #1
Walkthrough #2
Advanced Previews
Question and Answer
Practice Assignments

Workshop Website: jmgirard.github.io/tips2022ml

Workshop Slack: bit.ly/tips2022ml

Workshop Datasets

Walkthrough #1

Drug Consumption $(n=1876, p_{\text{cocaine_user}}=0.36)$

How well can we predict who is a cocaine user based on demographics, personality traits, and legal substance use?

Walkthrough #2

Cali Housing $(n=2000, \text{Range}=\$15\text{k}\text{ to }\$500\text{k})$

How well can we predict the median house value of a block in 1990 California based on its houses, population, and location?

Conceptual Introductions

What is machine learning?

Machine learning (ML) is a branch of computer science
ML researchers develop algorithms that learn from data
When algorithms learn from data, they create models

This workshop is about creating predictive models
Our goal will be to use known/available information…
To predict unknown/unavailable information…
Ideally in a way that generalized to novel/unseen data

Roles in Predictive Modeling

Labels / Outcomes

Labels are variables we want to predict the unknown values of
Labels tend to be expensive or difficult to measure in new data
Labels may also describe subjective experiences or future events
e.g., diagnoses, treatment response, substance use, suicide attempts

Features / Predictors

Features are variables we use to predict the unknown label values
Features tend to be relatively cheap and easy to measure in new data
e.g., demographics, self-reports, digital behavior, neurobiology

Modes of Predictive Modeling

Classification

When labels are categorical, predicting them is called “classification”
e.g., Will this patient respond to treatment or not?
e.g., Is this patient depressed, euthymic, or manic?

Regression

When labels are continuous, predicting them is called “regression”
e.g., How many days will the patient be hospitalized for?
e.g., How much of the drug will be in the patient’s blood?

A Delicate Balance

Any data we collect will contain a mixture of both signal and noise
- Informative patterns that generalize to new data are the signal
- Distracting patterns specific to the original data are the noise

We want to capture as much signal and as little noise as possible
- We accomplish this by finding the optimal model complexity

More complex models capture more signal but also more noise
- Overfitting: The model captures unwanted noise (too complex)
- Underfitting: The model misses important signal (too simple)

Model Complexity

Overfitting Memes

A Super Metaphor

ML’s superpower is its ability to learn complex patterns

However, its kryptonite (i.e., weakness) is its propensity to overfit

Thus, we must be careful to detect and counteract overfitting

Our primary tool for these goals is data splitting
- We learn feature-label relationships and tune the model complexity using one part (the training set)
- We evaluate the model’s ability to generalize to new data in another part that it has never seen before (the testing set)

A Simple Training Set

Two Competing Models

Bias (Error in Training Set)

Training Error A = 31.2 😟

Training Error B = 0.0 🤩

A Simple Testing Set

A Rematch in New Data

Variance (Error in Testing Set)

Testing Error A = 17.4 🙂

Testing Error B = 41.6 😨

Conclusions from Example

In ML, model “bias” is prediction error in the original data
In ML, model “variance” is prediction error in new/unseen data

An ideal model would have low bias and low variance
However, there is often an inherent bias-variance tradeoff
- We may counterintuitively prefer a model with more bias…
- Because accuracy in new data is the most important thing!

We want to the model that is as simple as possible but no simpler
We will thus “tune” our models to try to achieve this balance
- We will search for parameter values with the least variance

Tuning Complexity for Variance

Machine Learning Steps

Data Preparation: Importing, Tidying, Splitting

Model Development: Feature Engineering, Model Specification

Model Tuning: Resampling, Optimization, Selection

Model Evaluation: Performance, (Comparison), (Utility)

Model Use: Explanation, (Deployment), (Updating)

Classification Example

Predict cocaine use from demographics and personality

1. Import and tidy data

We load our data from a local text (i.e., CSV) file

We coerce all categorical variables to explicit factors

We retain only features/predictors and our label/outcome

# Load the tidyverse package for data reading and tidying
library(tidyverse)

drugs <- 
  # Read in data from CSV file
  read_csv("./data/drug_consumption.csv", show_col_types = FALSE) |> 
  # Transform categorical variables into factors
  mutate(
    across(c(Age:Ethnicity, Caff:Nicotine), factor),
    Coke = factor(Coke, levels = c(1, 0))  # Set positive class first
  ) |> 
  # Drop unneeded variables
  select(Age:Nicotine, Coke)

2. Split data for training and testing

We use 80% of the data for “training” and 20% for “testing”

We stratify both sets by our outcome variable (cocaine use)

# Load the tidymodels package for machine learning
library(tidymodels)

# Set random number generation seed for reproducibility
set.seed(2022)

# Create initial split for holdout validation
coke_split <- initial_split(drugs, prop = 0.8, strata = Coke)

# Extract and save the training set (for development and tuning)
coke_train <- training(coke_split)

# Extract and save the testing set (for evaluation only)
coke_test <- testing(coke_split)

3. Prepare feature engineering recipe

We assign a role (“outcome” or “predictor”) to each variable

We engineer (e.g., transform and filter) the features/predictors

coke_recipe <-
  # Initialize the recipe from the training set
  recipe(coke_train) |> 
  # Assign the outcome role to the label variable
  update_role(Coke, new_role = "outcome") |> 
  # Assign the predictor role to the feature variables
  update_role(Age:Nicotine, new_role = "predictor") |> 
  # Dummy code all nominal (i.e., factor) predictors
  step_dummy(all_nominal_predictors()) |> 
  # Remove predictors with near zero variance
  step_nzv(all_predictors()) |> 
  # Remove predictors that are highly inter-correlated
  step_corr(all_predictors()) |> 
  # Remove predictors that are linear combinations
  step_lincomb(all_predictors())

4. Set up model and parameters

We specify a Random Forest classifier using {ranger}

We specify that all three model parameters should be tuned

We set the “importance” metric for later model explanations

# Optional: view documentation for the random forest algorithm
?rand_forest

coke_model <-
  # Select the algorithm and which parameters to tune
  rand_forest(mtry = tune(), trees = tune(), min_n = tune()) |> 
  # Select the prediction mode (classification or regression)
  set_mode("classification") |> 
  # Select the engine to use and, optionally, the importance metric
  set_engine("ranger", importance = "impurity")

5. Prepare workflow

We combine our recipe and model into a “workflow” object

This helps R to repeat everything correctly during tuning

# Combine recipe and model specification into a workflow
coke_wflow <-
  workflow() |> 
  add_recipe(coke_recipe) |> 
  add_model(coke_model)

6. Tune parameter values

We split the training set into folds for cross-validation

We try lots of different values and record their performance

# Create 10-fold CV within training set for tuning
coke_folds <- vfold_cv(coke_train, v = 10, strata = Coke)

# Pick reasonable boundaries for the tuning parameters
coke_param <-
  extract_parameter_set_dials(coke_model) |> 
  finalize(coke_folds)

# Create and search over a grid of parameter values within boundaries
coke_tune <-
  coke_wflow |> 
  tune_grid(
    resamples = coke_folds,
    param_info = coke_param,
    grid = 20 # how many combinations to try? (more is better but slower)
  )

7. Finalize the workflow

We select our final parameter values from the tuning results

We refit to the entire training set and apply it to the testing set

# Select the best parameters values
coke_param_final <- select_best(coke_tune, metric = "accuracy")

# Finalize the workflow with the best parameter values
coke_wflow_final <-
  coke_wflow |> 
  finalize_workflow(coke_param_final)

# Fit the finalized workflow to training set and apply it to testing set
coke_final <-
  coke_wflow_final |> 
  last_fit(coke_split)

8. Evaluate model performance

We compare the test set predictions to their labels

We make a confusion matrix and ROC curve (for classification)

# View the metrics (from the testing set)
collect_metrics(coke_final)

# Collect predictions (from the testing set)
coke_pred <- collect_predictions(coke_final)
coke_pred

# Calculate and plot confusion matrix
coke_cm <- conf_mat(coke_pred, truth = Coke, estimate = .pred_class)
coke_cm
autoplot(coke_cm, type = "heatmap")
summary(coke_cm)

# Calculate and plot ROC curve
coke_rc <- roc_curve(coke_pred, truth = Coke, .pred_1)
autoplot(coke_rc)

Bonus: Explore variable importance

Which variables were most important to prediction?

Not all algorithms support variable importance measures

library(vip)

# Extract the variable importance measures
coke_final |> 
  extract_fit_parsnip() |> 
  vi()

# Plot the variable importance measures
coke_final |> 
  extract_fit_parsnip() |> 
  vip()

Putting it all together

library(tidyverse)
library(tidymodels)
library(vip)

# 1. Import and tidy data
drugs <- 
  read_csv("./data/drug_consumption.csv", show_col_types = FALSE) |> 
  mutate(
    across(c(Age:Ethnicity, Caff:Nicotine), factor),
    Coke = factor(Coke, levels = c(1, 0))
  ) |> 
  select(Age:Nicotine, Coke)

# 2. Split data for training and testing
set.seed(2022)
coke_split <- initial_split(drugs, prop = 0.8, strata = Coke)
coke_train <- training(coke_split)
coke_test <- testing(coke_split)

# 3. Prepare preprocessing recipe
coke_recipe <-
  recipe(coke_train) |> 
  update_role(Coke, new_role = "outcome") |> 
  update_role(Age:Nicotine, new_role = "predictor") |> 
  step_dummy(all_nominal_predictors()) |> 
  step_nzv(all_predictors()) |> 
  step_corr(all_predictors()) |> 
  step_lincomb(all_predictors())

# 4. Set up model and parameters
coke_model <-
  rand_forest(mtry = tune(), trees = tune(), min_n = tune()) |> 
  set_mode("classification") |> 
  set_engine("ranger", importance = "impurity")

# 5. Prepare workflow
coke_wflow <-
  workflow() |> 
  add_recipe(coke_recipe) |> 
  add_model(coke_model)

# 6. Tune parameters using grid search
coke_folds <- vfold_cv(coke_train, v = 10, strata = Coke)
coke_param <-
  extract_parameter_set_dials(coke_model) |> 
  finalize(coke_folds)
coke_tune <-
  coke_wflow |> 
  tune_grid(
    resamples = coke_folds,
    param_info = coke_param,
    grid = 20
  )

# 7. Finalize the workflow
coke_param_final <- select_best(coke_tune, metric = "accuracy")
coke_wflow_final <-
  coke_wflow |> 
  finalize_workflow(coke_param_final)
coke_final <-
  coke_wflow_final |> 
  last_fit(coke_split)

# 8. Evaluate model performance
collect_metrics(coke_final)
coke_pred <- collect_predictions(coke_final)
coke_cm <- conf_mat(coke_pred, truth = Coke, estimate = .pred_class)
autoplot(coke_cm, type = "heatmap")
summary(coke_cm)
coke_rc <- roc_curve(coke_pred, truth = Coke, .pred_0)
autoplot(coke_rc)

# Bonus: Explore variable importance
coke_final |> 
  extract_fit_parsnip() |> 
  vi()
coke_final |> 
  extract_fit_parsnip() |> 
  vip()

Regression Example

Predict median housing value from block information

Necessary Changes

To adapt this process to regression, we only need to:

Assign a continuous variable to the “outcome” role
Select an algorithm that supports regression

(Note that some support both, e.g., random forest)
Set the prediction mode to “regression”
Use regression performance metrics (e.g., RMSE or $R^2$)
Skip the confusion matrix and ROC curve

(We can instead plot predictions against trusted labels)

Putting it all together… again

library(tidyverse)
library(tidymodels)
library(vip)

# 1. Import and tidy data
calihouse <- read_csv("./data/calihousing.csv", show_col_types = FALSE)

# 2. Split data for training and testing
set.seed(2022)
ch_split <- initial_split(calihouse, prop = 0.8, strata = house_mdn_value)
ch_train <- training(ch_split)
ch_test <- testing(ch_split)

# 3. Prepare preprocessing recipe
ch_recipe <-
  recipe(ch_train) |> 
  update_role(house_mdn_value, new_role = "outcome") |>
  update_role(house_mdn_age:dist_sj, new_role = "predictor") |> 
  step_nzv(all_predictors()) |> 
  step_corr(all_predictors()) |> 
  step_lincomb(all_predictors()) |> 
  step_normalize(all_numeric_predictors())

# 4. Set up model and parameters
ch_model <-
  rand_forest(mtry = tune(), trees = tune(), min_n = tune()) |> 
  set_mode("regression") |>
  set_engine("ranger", importance = "impurity")

# 5. Prepare workflow
ch_wflow <-
  workflow() |> 
  add_recipe(ch_recipe) |> 
  add_model(ch_model)

# 6. Tune parameters using grid search
ch_folds <- vfold_cv(ch_train, v = 10, strata = house_mdn_value)
ch_param <-
  ch_model |> 
  extract_parameter_set_dials() |> 
  finalize(ch_folds)
ch_tune <-
  ch_wflow |> 
  tune_grid(
    resamples = ch_folds,
    param_info = ch_param,
    grid = 20
  )

# 7. Finalize the workflow
ch_param_final <- select_best(ch_tune, metric = "rmse")
ch_wflow_final <-
  ch_wflow |> 
  finalize_workflow(ch_param_final)
ch_final <-
  ch_wflow_final |> 
  last_fit(ch_split)

# 8. Evaluate model performance
collect_metrics(ch_final)
ch_pred <- collect_predictions(ch_final)
ggplot(ch_pred, aes(x = house_mdn_value, y = .pred)) +
  geom_point(alpha = 1/5) +
  geom_abline() +
  coord_obs_pred()

# Bonus: Explore variable importance
ch_final |> 
  extract_fit_parsnip() |> 
  vi()
ch_final |> 
  extract_fit_parsnip() |> 
  vip()

Advanced Previews

Data Splitting and Preprocessing

Clustered and longitudinal data
- rsample::group_vfold_cv()
- rsample::rolling_origin()
Nested cross-validation
- rsample::nested_cv()

Dimensionality Reduction
- recipes::step_pca()
- recipes::step_nnmf()
Imputing missing predictors
- recipes::step_impute_linear()
- recipes::step_impute_knn()

Development and Tuning

Alternative algorithms
- parsnip::linear_reg(penalty, mixture)
- parsnip::logistic_reg(penalty, mixture)
- parsnip::svm_rbf(cost)

Alternative tuning procedures
- finetune::tune_sim_anneal()
- finetune::tune_race_anova()

Combining models into ensembles
- stacks::stacks()
- stacks::blend_predictions()

Model Evaluation

Workflow sets and parallelization
- workflowsets::workflow_set()
- tune::control_grid()

Advanced/alternative metrics
- yardstick::mcc() or yardstick::mn_log_loss()
- yardstick::ccc() or yardstick::huber_loss()

Statistical model comparison
- tidyposterior::perf_mod()
- tidyposterior::contrast_models()

Explanation and Deployment

Advanced model explanations
- DALEXtra::explain_tidymodels()
- DALEXtra::model_parts()
- DALEXtra::predict_parts()
- DALEXtra::model_profile()

Bundling and deploying models
- bundle::bundle()
- vetiver::vetiver_model()

Question and Answer

Resources

Online Textbook:

Tidy Modeling with R (Kuhn & Silge)

https://www.tmwr.org

Package Reference Website:

The tidymodels framework (RStudio/Posit)

https://www.tidymodels.org

Four-day summer course:

Applied Machine Learning in R (Girard & Wang)

https://www.pittmethods.com/applied-ml

Practice Assignments

Refit the model from Walkthrough #1 using regularized logistic regression classification and the GLMNET engine:

logistic_reg(penalty = "tune", mixture = "tune") |> 
set_mode("classification") |> 
set_engine("glmnet")

Adjust the model from Walkthrough #1 to predict Methamphetamine use (Meth) rather than Cocaine use (Coke)

Adjust the model from Walkthrough #2 to search over a larger grid of 125 parameter values (i.e., 5 unique values for each parameter or $5^3$)

Explore the additional datasets on the workshop website and adapt your code to build a predictive model for one of these datasets.

Machine Learning in R with tidymodels

Overview

Instructor

Workshop Overview

Workshop Datasets

Conceptual Introductions

What is machine learning?

Roles in Predictive Modeling

Modes of Predictive Modeling

A Delicate Balance

Model Complexity

Overfitting Memes

A Super Metaphor

A Simple Training Set

Two Competing Models

Bias (Error in Training Set)

A Simple Testing Set

A Rematch in New Data

Variance (Error in Testing Set)

Conclusions from Example

Tuning Complexity for Variance

Machine Learning Steps

Classification Example

1. Import and tidy data

2. Split data for training and testing

3. Prepare feature engineering recipe

4. Set up model and parameters

5. Prepare workflow

6. Tune parameter values

7. Finalize the workflow

8. Evaluate model performance

Bonus: Explore variable importance

Putting it all together

Regression Example

Necessary Changes

Putting it all together… again

Advanced Previews

Data Splitting and Preprocessing

Development and Tuning

Model Evaluation

Explanation and Deployment

Question and Answer

Resources

Practice Assignments

Practice Assignments

Machine Learning
in R with tidymodels