Machine Learning
in R with tidymodels

Jeffrey M. Girard | University of Kansas
Technology in Psychiatry Summit 2022

Overview

Instructor

Jeffrey Girard, PhD
www.jmgirard.com
jmgirard@ku.edu

Background

  • Assistant Professor, University of Kansas
  • Clinical Psychological Science Program
  • Brain, Behavior, and Quantitative Science Program

Research Areas

  • Psychological/Clinical Assessment
  • Affective/Interpersonal Communication
  • Applied Statistics and Machine Learning
  • Data Science and Software Engineering

Workshop Overview

  1. Conceptual Introductions

  2. Walkthrough #1

  3. Walkthrough #2

  4. Advanced Previews

  5. Question and Answer

  6. Practice Assignments

Workshop Website: jmgirard.github.io/tips2022ml

Workshop Slack: bit.ly/tips2022ml

Workshop Datasets

  • Walkthrough #1

    Drug Consumption \((n=1876, p_{\text{cocaine_user}}=0.36)\)

    How well can we predict who is a cocaine user based on demographics, personality traits, and legal substance use?

  • Walkthrough #2

    Cali Housing \((n=2000, \text{Range}=\$15\text{k}\text{ to }\$500\text{k})\)

    How well can we predict the median house value of a block in 1990 California based on its houses, population, and location?

Conceptual Introductions

What is machine learning?

  • Machine learning (ML) is a branch of computer science

  • ML researchers develop algorithms that learn from data

  • When algorithms learn from data, they create models

  • This workshop is about creating predictive models

  • Our goal will be to use known/available information

  • To predict unknown/unavailable information

  • Ideally in a way that generalized to novel/unseen data

Roles in Predictive Modeling

Labels / Outcomes

  • Labels are variables we want to predict the unknown values of

  • Labels tend to be expensive or difficult to measure in new data

  • Labels may also describe subjective experiences or future events

  • e.g., diagnoses, treatment response, substance use, suicide attempts

Features / Predictors

  • Features are variables we use to predict the unknown label values

  • Features tend to be relatively cheap and easy to measure in new data

  • e.g., demographics, self-reports, digital behavior, neurobiology

Modes of Predictive Modeling

Classification

  • When labels are categorical, predicting them is called “classification”

  • e.g., Will this patient respond to treatment or not?

  • e.g., Is this patient depressed, euthymic, or manic?

Regression

  • When labels are continuous, predicting them is called “regression”

  • e.g., How many days will the patient be hospitalized for?

  • e.g., How much of the drug will be in the patient’s blood?

A Delicate Balance

  • Any data we collect will contain a mixture of both signal and noise

    • Informative patterns that generalize to new data are the signal
    • Distracting patterns specific to the original data are the noise
  • We want to capture as much signal and as little noise as possible

    • We accomplish this by finding the optimal model complexity
  • More complex models capture more signal but also more noise

    • Overfitting: The model captures unwanted noise (too complex)
    • Underfitting: The model misses important signal (too simple)

Model Complexity

Overfitting Memes

A Super Metaphor

  • ML’s superpower is its ability to learn complex patterns
  • However, its kryptonite (i.e., weakness) is its propensity to overfit
  • Thus, we must be careful to detect and counteract overfitting
  • Our primary tool for these goals is data splitting

    • We learn feature-label relationships and tune the model complexity using one part (the training set)

    • We evaluate the model’s ability to generalize to new data in another part that it has never seen before (the testing set)

A Simple Training Set

Two Competing Models

Bias (Error in Training Set)

Training Error A = 31.2 😟

Training Error B = 0.0 🤩

A Simple Testing Set

A Rematch in New Data

Variance (Error in Testing Set)

Testing Error A = 17.4 🙂

Testing Error B = 41.6 😨

Conclusions from Example

  • In ML, model “bias” is prediction error in the original data

  • In ML, model “variance” is prediction error in new/unseen data

  • An ideal model would have low bias and low variance

  • However, there is often an inherent bias-variance tradeoff

    • We may counterintuitively prefer a model with more bias…
    • Because accuracy in new data is the most important thing!
  • We want to the model that is as simple as possible but no simpler

  • We will thus “tune” our models to try to achieve this balance

    • We will search for parameter values with the least variance

Tuning Complexity for Variance

Machine Learning Steps

  • Data Preparation: Importing, Tidying, Splitting
  • Model Development: Feature Engineering, Model Specification
  • Model Tuning: Resampling, Optimization, Selection
  • Model Evaluation: Performance, (Comparison), (Utility)
  • Model Use: Explanation, (Deployment), (Updating)

Classification Example

Predict cocaine use from demographics and personality

1. Import and tidy data

We load our data from a local text (i.e., CSV) file

We coerce all categorical variables to explicit factors

We retain only features/predictors and our label/outcome

# Load the tidyverse package for data reading and tidying
library(tidyverse)

drugs <- 
  # Read in data from CSV file
  read_csv("./data/drug_consumption.csv", show_col_types = FALSE) |> 
  # Transform categorical variables into factors
  mutate(
    across(c(Age:Ethnicity, Caff:Nicotine), factor),
    Coke = factor(Coke, levels = c(1, 0))  # Set positive class first
  ) |> 
  # Drop unneeded variables
  select(Age:Nicotine, Coke)

2. Split data for training and testing

We use 80% of the data for “training” and 20% for “testing”

We stratify both sets by our outcome variable (cocaine use)

# Load the tidymodels package for machine learning
library(tidymodels)

# Set random number generation seed for reproducibility
set.seed(2022)

# Create initial split for holdout validation
coke_split <- initial_split(drugs, prop = 0.8, strata = Coke)

# Extract and save the training set (for development and tuning)
coke_train <- training(coke_split)

# Extract and save the testing set (for evaluation only)
coke_test <- testing(coke_split)

3. Prepare feature engineering recipe

We assign a role (“outcome” or “predictor”) to each variable

We engineer (e.g., transform and filter) the features/predictors

coke_recipe <-
  # Initialize the recipe from the training set
  recipe(coke_train) |> 
  # Assign the outcome role to the label variable
  update_role(Coke, new_role = "outcome") |> 
  # Assign the predictor role to the feature variables
  update_role(Age:Nicotine, new_role = "predictor") |> 
  # Dummy code all nominal (i.e., factor) predictors
  step_dummy(all_nominal_predictors()) |> 
  # Remove predictors with near zero variance
  step_nzv(all_predictors()) |> 
  # Remove predictors that are highly inter-correlated
  step_corr(all_predictors()) |> 
  # Remove predictors that are linear combinations
  step_lincomb(all_predictors())

4. Set up model and parameters

We specify a Random Forest classifier using {ranger}

We specify that all three model parameters should be tuned

We set the “importance” metric for later model explanations

# Optional: view documentation for the random forest algorithm
?rand_forest

coke_model <-
  # Select the algorithm and which parameters to tune
  rand_forest(mtry = tune(), trees = tune(), min_n = tune()) |> 
  # Select the prediction mode (classification or regression)
  set_mode("classification") |> 
  # Select the engine to use and, optionally, the importance metric
  set_engine("ranger", importance = "impurity")

5. Prepare workflow

We combine our recipe and model into a “workflow” object

This helps R to repeat everything correctly during tuning

# Combine recipe and model specification into a workflow
coke_wflow <-
  workflow() |> 
  add_recipe(coke_recipe) |> 
  add_model(coke_model)

6. Tune parameter values

We split the training set into folds for cross-validation

We try lots of different values and record their performance

# Create 10-fold CV within training set for tuning
coke_folds <- vfold_cv(coke_train, v = 10, strata = Coke)

# Pick reasonable boundaries for the tuning parameters
coke_param <-
  extract_parameter_set_dials(coke_model) |> 
  finalize(coke_folds)

# Create and search over a grid of parameter values within boundaries
coke_tune <-
  coke_wflow |> 
  tune_grid(
    resamples = coke_folds,
    param_info = coke_param,
    grid = 20 # how many combinations to try? (more is better but slower)
  )

7. Finalize the workflow

We select our final parameter values from the tuning results

We refit to the entire training set and apply it to the testing set

# Select the best parameters values
coke_param_final <- select_best(coke_tune, metric = "accuracy")

# Finalize the workflow with the best parameter values
coke_wflow_final <-
  coke_wflow |> 
  finalize_workflow(coke_param_final)

# Fit the finalized workflow to training set and apply it to testing set
coke_final <-
  coke_wflow_final |> 
  last_fit(coke_split)

8. Evaluate model performance

We compare the test set predictions to their labels

We make a confusion matrix and ROC curve (for classification)

# View the metrics (from the testing set)
collect_metrics(coke_final)

# Collect predictions (from the testing set)
coke_pred <- collect_predictions(coke_final)
coke_pred

# Calculate and plot confusion matrix
coke_cm <- conf_mat(coke_pred, truth = Coke, estimate = .pred_class)
coke_cm
autoplot(coke_cm, type = "heatmap")
summary(coke_cm)

# Calculate and plot ROC curve
coke_rc <- roc_curve(coke_pred, truth = Coke, .pred_1)
autoplot(coke_rc)

Bonus: Explore variable importance

Which variables were most important to prediction?

  • Not all algorithms support variable importance measures
library(vip)

# Extract the variable importance measures
coke_final |> 
  extract_fit_parsnip() |> 
  vi()

# Plot the variable importance measures
coke_final |> 
  extract_fit_parsnip() |> 
  vip()

Putting it all together

library(tidyverse)
library(tidymodels)
library(vip)

# 1. Import and tidy data
drugs <- 
  read_csv("./data/drug_consumption.csv", show_col_types = FALSE) |> 
  mutate(
    across(c(Age:Ethnicity, Caff:Nicotine), factor),
    Coke = factor(Coke, levels = c(1, 0))
  ) |> 
  select(Age:Nicotine, Coke)

# 2. Split data for training and testing
set.seed(2022)
coke_split <- initial_split(drugs, prop = 0.8, strata = Coke)
coke_train <- training(coke_split)
coke_test <- testing(coke_split)

# 3. Prepare preprocessing recipe
coke_recipe <-
  recipe(coke_train) |> 
  update_role(Coke, new_role = "outcome") |> 
  update_role(Age:Nicotine, new_role = "predictor") |> 
  step_dummy(all_nominal_predictors()) |> 
  step_nzv(all_predictors()) |> 
  step_corr(all_predictors()) |> 
  step_lincomb(all_predictors())

# 4. Set up model and parameters
coke_model <-
  rand_forest(mtry = tune(), trees = tune(), min_n = tune()) |> 
  set_mode("classification") |> 
  set_engine("ranger", importance = "impurity")

# 5. Prepare workflow
coke_wflow <-
  workflow() |> 
  add_recipe(coke_recipe) |> 
  add_model(coke_model)

# 6. Tune parameters using grid search
coke_folds <- vfold_cv(coke_train, v = 10, strata = Coke)
coke_param <-
  extract_parameter_set_dials(coke_model) |> 
  finalize(coke_folds)
coke_tune <-
  coke_wflow |> 
  tune_grid(
    resamples = coke_folds,
    param_info = coke_param,
    grid = 20
  )

# 7. Finalize the workflow
coke_param_final <- select_best(coke_tune, metric = "accuracy")
coke_wflow_final <-
  coke_wflow |> 
  finalize_workflow(coke_param_final)
coke_final <-
  coke_wflow_final |> 
  last_fit(coke_split)

# 8. Evaluate model performance
collect_metrics(coke_final)
coke_pred <- collect_predictions(coke_final)
coke_cm <- conf_mat(coke_pred, truth = Coke, estimate = .pred_class)
autoplot(coke_cm, type = "heatmap")
summary(coke_cm)
coke_rc <- roc_curve(coke_pred, truth = Coke, .pred_0)
autoplot(coke_rc)

# Bonus: Explore variable importance
coke_final |> 
  extract_fit_parsnip() |> 
  vi()
coke_final |> 
  extract_fit_parsnip() |> 
  vip()

Regression Example

Predict median housing value from block information

Necessary Changes

To adapt this process to regression, we only need to:

  • Assign a continuous variable to the “outcome” role

  • Select an algorithm that supports regression

    (Note that some support both, e.g., random forest)

  • Set the prediction mode to “regression”

  • Use regression performance metrics (e.g., RMSE or \(R^2\))

  • Skip the confusion matrix and ROC curve

    (We can instead plot predictions against trusted labels)

Putting it all together… again

library(tidyverse)
library(tidymodels)
library(vip)

# 1. Import and tidy data
calihouse <- read_csv("./data/calihousing.csv", show_col_types = FALSE)

# 2. Split data for training and testing
set.seed(2022)
ch_split <- initial_split(calihouse, prop = 0.8, strata = house_mdn_value)
ch_train <- training(ch_split)
ch_test <- testing(ch_split)

# 3. Prepare preprocessing recipe
ch_recipe <-
  recipe(ch_train) |> 
  update_role(house_mdn_value, new_role = "outcome") |>
  update_role(house_mdn_age:dist_sj, new_role = "predictor") |> 
  step_nzv(all_predictors()) |> 
  step_corr(all_predictors()) |> 
  step_lincomb(all_predictors()) |> 
  step_normalize(all_numeric_predictors())

# 4. Set up model and parameters
ch_model <-
  rand_forest(mtry = tune(), trees = tune(), min_n = tune()) |> 
  set_mode("regression") |>
  set_engine("ranger", importance = "impurity")

# 5. Prepare workflow
ch_wflow <-
  workflow() |> 
  add_recipe(ch_recipe) |> 
  add_model(ch_model)

# 6. Tune parameters using grid search
ch_folds <- vfold_cv(ch_train, v = 10, strata = house_mdn_value)
ch_param <-
  ch_model |> 
  extract_parameter_set_dials() |> 
  finalize(ch_folds)
ch_tune <-
  ch_wflow |> 
  tune_grid(
    resamples = ch_folds,
    param_info = ch_param,
    grid = 20
  )

# 7. Finalize the workflow
ch_param_final <- select_best(ch_tune, metric = "rmse")
ch_wflow_final <-
  ch_wflow |> 
  finalize_workflow(ch_param_final)
ch_final <-
  ch_wflow_final |> 
  last_fit(ch_split)

# 8. Evaluate model performance
collect_metrics(ch_final)
ch_pred <- collect_predictions(ch_final)
ggplot(ch_pred, aes(x = house_mdn_value, y = .pred)) +
  geom_point(alpha = 1/5) +
  geom_abline() +
  coord_obs_pred()

# Bonus: Explore variable importance
ch_final |> 
  extract_fit_parsnip() |> 
  vi()
ch_final |> 
  extract_fit_parsnip() |> 
  vip()

Advanced Previews

Data Splitting and Preprocessing

  • Clustered and longitudinal data

    • rsample::group_vfold_cv()
    • rsample::rolling_origin()
  • Nested cross-validation

    • rsample::nested_cv()
  • Dimensionality Reduction

    • recipes::step_pca()
    • recipes::step_nnmf()
  • Imputing missing predictors

    • recipes::step_impute_linear()
    • recipes::step_impute_knn()

Development and Tuning

  • Alternative algorithms

    • parsnip::linear_reg(penalty, mixture)
    • parsnip::logistic_reg(penalty, mixture)
    • parsnip::svm_rbf(cost)
  • Alternative tuning procedures

    • finetune::tune_sim_anneal()
    • finetune::tune_race_anova()
  • Combining models into ensembles

    • stacks::stacks()
    • stacks::blend_predictions()

Model Evaluation

  • Workflow sets and parallelization

    • workflowsets::workflow_set()
    • tune::control_grid()
  • Advanced/alternative metrics

    • yardstick::mcc() or yardstick::mn_log_loss()
    • yardstick::ccc() or yardstick::huber_loss()
  • Statistical model comparison

    • tidyposterior::perf_mod()
    • tidyposterior::contrast_models()

Explanation and Deployment

  • Advanced model explanations

    • DALEXtra::explain_tidymodels()
    • DALEXtra::model_parts()
    • DALEXtra::predict_parts()
    • DALEXtra::model_profile()
  • Bundling and deploying models

    • bundle::bundle()
    • vetiver::vetiver_model()

Question and Answer

Resources

Practice Assignments

Practice Assignments

  1. Refit the model from Walkthrough #1 using regularized logistic regression classification and the GLMNET engine:
logistic_reg(penalty = "tune", mixture = "tune") |> 
set_mode("classification") |> 
set_engine("glmnet")
  1. Adjust the model from Walkthrough #1 to predict Methamphetamine use (Meth) rather than Cocaine use (Coke)
  1. Adjust the model from Walkthrough #2 to search over a larger grid of 125 parameter values (i.e., 5 unique values for each parameter or \(5^3\))
  1. Explore the additional datasets on the workshop website and adapt your code to build a predictive model for one of these datasets.