Jeffrey M. Girard | University of Kansas
Technology in Psychiatry Summit 2022
Jeffrey Girard, PhD
www.jmgirard.com
jmgirard@ku.edu
Background
Research Areas
Conceptual Introductions
Walkthrough #1
Walkthrough #2
Advanced Previews
Question and Answer
Practice Assignments
Workshop Website: jmgirard.github.io/tips2022ml
Workshop Slack: bit.ly/tips2022ml
Walkthrough #1
Drug Consumption \((n=1876, p_{\text{cocaine_user}}=0.36)\)
How well can we predict who is a cocaine user based on demographics, personality traits, and legal substance use?
Walkthrough #2
Cali Housing \((n=2000, \text{Range}=\$15\text{k}\text{ to }\$500\text{k})\)
How well can we predict the median house value of a block in 1990 California based on its houses, population, and location?
Machine learning (ML) is a branch of computer science
ML researchers develop algorithms that learn from data
When algorithms learn from data, they create models
This workshop is about creating predictive models
Our goal will be to use known/available information…
To predict unknown/unavailable information…
Ideally in a way that generalized to novel/unseen data
Note. ML researchers also create models for other purposes (e.g., data generation).
Labels / Outcomes
Labels are variables we want to predict the unknown values of
Labels tend to be expensive or difficult to measure in new data
Labels may also describe subjective experiences or future events
e.g., diagnoses, treatment response, substance use, suicide attempts
Features / Predictors
Features are variables we use to predict the unknown label values
Features tend to be relatively cheap and easy to measure in new data
e.g., demographics, self-reports, digital behavior, neurobiology
Classification
When labels are categorical, predicting them is called “classification”
e.g., Will this patient respond to treatment or not?
e.g., Is this patient depressed, euthymic, or manic?
Regression
When labels are continuous, predicting them is called “regression”
e.g., How many days will the patient be hospitalized for?
e.g., How much of the drug will be in the patient’s blood?
Note. Unsupervised ML (or “data mining”) has no explicit labels.
Any data we collect will contain a mixture of both signal and noise
We want to capture as much signal and as little noise as possible
More complex models capture more signal but also more noise
Note. Complexity here is controlled by the highest polynomial degree \((d)\).
Our primary tool for these goals is data splitting
We learn feature-label relationships and tune the model complexity using one part (the training set)
We evaluate the model’s ability to generalize to new data in another part that it has never seen before (the testing set)
Note. Cross-validation is a sophisticated form of data splitting.
Training Error A = 31.2 😟
Training Error B = 0.0 🤩
Testing Error A = 17.4 🙂
Testing Error B = 41.6 😨
In ML, model “bias” is prediction error in the original data
In ML, model “variance” is prediction error in new/unseen data
An ideal model would have low bias and low variance
However, there is often an inherent bias-variance tradeoff
We want to the model that is as simple as possible but no simpler
We will thus “tune” our models to try to achieve this balance
Predict cocaine use from demographics and personality
We load our data from a local text (i.e., CSV) file
We coerce all categorical variables to explicit factors
We retain only features/predictors and our label/outcome
# Load the tidyverse package for data reading and tidying
library(tidyverse)
drugs <-
# Read in data from CSV file
read_csv("./data/drug_consumption.csv", show_col_types = FALSE) |>
# Transform categorical variables into factors
mutate(
across(c(Age:Ethnicity, Caff:Nicotine), factor),
Coke = factor(Coke, levels = c(1, 0)) # Set positive class first
) |>
# Drop unneeded variables
select(Age:Nicotine, Coke)
Note. Read “R for Data Science” to learn about efficient data tidying in R.
We use 80% of the data for “training” and 20% for “testing”
We stratify both sets by our outcome variable (cocaine use)
# Load the tidymodels package for machine learning
library(tidymodels)
# Set random number generation seed for reproducibility
set.seed(2022)
# Create initial split for holdout validation
coke_split <- initial_split(drugs, prop = 0.8, strata = Coke)
# Extract and save the training set (for development and tuning)
coke_train <- training(coke_split)
# Extract and save the testing set (for evaluation only)
coke_test <- testing(coke_split)
Note. Stratification ensures a similar distribution across all splits.
We assign a role (“outcome” or “predictor”) to each variable
We engineer (e.g., transform and filter) the features/predictors
coke_recipe <-
# Initialize the recipe from the training set
recipe(coke_train) |>
# Assign the outcome role to the label variable
update_role(Coke, new_role = "outcome") |>
# Assign the predictor role to the feature variables
update_role(Age:Nicotine, new_role = "predictor") |>
# Dummy code all nominal (i.e., factor) predictors
step_dummy(all_nominal_predictors()) |>
# Remove predictors with near zero variance
step_nzv(all_predictors()) |>
# Remove predictors that are highly inter-correlated
step_corr(all_predictors()) |>
# Remove predictors that are linear combinations
step_lincomb(all_predictors())
Note. Many other feature engineering steps are available from {recipes}.
We specify a Random Forest classifier using {ranger}
We specify that all three model parameters should be tuned
We set the “importance” metric for later model explanations
# Optional: view documentation for the random forest algorithm
?rand_forest
coke_model <-
# Select the algorithm and which parameters to tune
rand_forest(mtry = tune(), trees = tune(), min_n = tune()) |>
# Select the prediction mode (classification or regression)
set_mode("classification") |>
# Select the engine to use and, optionally, the importance metric
set_engine("ranger", importance = "impurity")
Note. Many other algorithms and engines are available from {parsnip}.
We combine our recipe and model into a “workflow” object
This helps R to repeat everything correctly during tuning
Note. We could also combine multiple workflows into {workflowsets}.
We split the training set into folds for cross-validation
We try lots of different values and record their performance
# Create 10-fold CV within training set for tuning
coke_folds <- vfold_cv(coke_train, v = 10, strata = Coke)
# Pick reasonable boundaries for the tuning parameters
coke_param <-
extract_parameter_set_dials(coke_model) |>
finalize(coke_folds)
# Create and search over a grid of parameter values within boundaries
coke_tune <-
coke_wflow |>
tune_grid(
resamples = coke_folds,
param_info = coke_param,
grid = 20 # how many combinations to try? (more is better but slower)
)
Note. Other tuning procedures are available from {tune} and {finetune}.
We select our final parameter values from the tuning results
We refit to the entire training set and apply it to the testing set
# Select the best parameters values
coke_param_final <- select_best(coke_tune, metric = "accuracy")
# Finalize the workflow with the best parameter values
coke_wflow_final <-
coke_wflow |>
finalize_workflow(coke_param_final)
# Fit the finalized workflow to training set and apply it to testing set
coke_final <-
coke_wflow_final |>
last_fit(coke_split)
Note. More complicated selection procedures are available from {tune}.
We compare the test set predictions to their labels
We make a confusion matrix and ROC curve (for classification)
# View the metrics (from the testing set)
collect_metrics(coke_final)
# Collect predictions (from the testing set)
coke_pred <- collect_predictions(coke_final)
coke_pred
# Calculate and plot confusion matrix
coke_cm <- conf_mat(coke_pred, truth = Coke, estimate = .pred_class)
coke_cm
autoplot(coke_cm, type = "heatmap")
summary(coke_cm)
# Calculate and plot ROC curve
coke_rc <- roc_curve(coke_pred, truth = Coke, .pred_1)
autoplot(coke_rc)
Note. Many other performance tools are available from {yardstick}.
Which variables were most important to prediction?
library(tidyverse)
library(tidymodels)
library(vip)
# 1. Import and tidy data
drugs <-
read_csv("./data/drug_consumption.csv", show_col_types = FALSE) |>
mutate(
across(c(Age:Ethnicity, Caff:Nicotine), factor),
Coke = factor(Coke, levels = c(1, 0))
) |>
select(Age:Nicotine, Coke)
# 2. Split data for training and testing
set.seed(2022)
coke_split <- initial_split(drugs, prop = 0.8, strata = Coke)
coke_train <- training(coke_split)
coke_test <- testing(coke_split)
# 3. Prepare preprocessing recipe
coke_recipe <-
recipe(coke_train) |>
update_role(Coke, new_role = "outcome") |>
update_role(Age:Nicotine, new_role = "predictor") |>
step_dummy(all_nominal_predictors()) |>
step_nzv(all_predictors()) |>
step_corr(all_predictors()) |>
step_lincomb(all_predictors())
# 4. Set up model and parameters
coke_model <-
rand_forest(mtry = tune(), trees = tune(), min_n = tune()) |>
set_mode("classification") |>
set_engine("ranger", importance = "impurity")
# 5. Prepare workflow
coke_wflow <-
workflow() |>
add_recipe(coke_recipe) |>
add_model(coke_model)
# 6. Tune parameters using grid search
coke_folds <- vfold_cv(coke_train, v = 10, strata = Coke)
coke_param <-
extract_parameter_set_dials(coke_model) |>
finalize(coke_folds)
coke_tune <-
coke_wflow |>
tune_grid(
resamples = coke_folds,
param_info = coke_param,
grid = 20
)
# 7. Finalize the workflow
coke_param_final <- select_best(coke_tune, metric = "accuracy")
coke_wflow_final <-
coke_wflow |>
finalize_workflow(coke_param_final)
coke_final <-
coke_wflow_final |>
last_fit(coke_split)
# 8. Evaluate model performance
collect_metrics(coke_final)
coke_pred <- collect_predictions(coke_final)
coke_cm <- conf_mat(coke_pred, truth = Coke, estimate = .pred_class)
autoplot(coke_cm, type = "heatmap")
summary(coke_cm)
coke_rc <- roc_curve(coke_pred, truth = Coke, .pred_0)
autoplot(coke_rc)
# Bonus: Explore variable importance
coke_final |>
extract_fit_parsnip() |>
vi()
coke_final |>
extract_fit_parsnip() |>
vip()
Predict median housing value from block information
To adapt this process to regression, we only need to:
Assign a continuous variable to the “outcome” role
Select an algorithm that supports regression
(Note that some support both, e.g., random forest)
Set the prediction mode to “regression”
Use regression performance metrics (e.g., RMSE or \(R^2\))
Skip the confusion matrix and ROC curve
(We can instead plot predictions against trusted labels)
library(tidyverse)
library(tidymodels)
library(vip)
# 1. Import and tidy data
calihouse <- read_csv("./data/calihousing.csv", show_col_types = FALSE)
# 2. Split data for training and testing
set.seed(2022)
ch_split <- initial_split(calihouse, prop = 0.8, strata = house_mdn_value)
ch_train <- training(ch_split)
ch_test <- testing(ch_split)
# 3. Prepare preprocessing recipe
ch_recipe <-
recipe(ch_train) |>
update_role(house_mdn_value, new_role = "outcome") |>
update_role(house_mdn_age:dist_sj, new_role = "predictor") |>
step_nzv(all_predictors()) |>
step_corr(all_predictors()) |>
step_lincomb(all_predictors()) |>
step_normalize(all_numeric_predictors())
# 4. Set up model and parameters
ch_model <-
rand_forest(mtry = tune(), trees = tune(), min_n = tune()) |>
set_mode("regression") |>
set_engine("ranger", importance = "impurity")
# 5. Prepare workflow
ch_wflow <-
workflow() |>
add_recipe(ch_recipe) |>
add_model(ch_model)
# 6. Tune parameters using grid search
ch_folds <- vfold_cv(ch_train, v = 10, strata = house_mdn_value)
ch_param <-
ch_model |>
extract_parameter_set_dials() |>
finalize(ch_folds)
ch_tune <-
ch_wflow |>
tune_grid(
resamples = ch_folds,
param_info = ch_param,
grid = 20
)
# 7. Finalize the workflow
ch_param_final <- select_best(ch_tune, metric = "rmse")
ch_wflow_final <-
ch_wflow |>
finalize_workflow(ch_param_final)
ch_final <-
ch_wflow_final |>
last_fit(ch_split)
# 8. Evaluate model performance
collect_metrics(ch_final)
ch_pred <- collect_predictions(ch_final)
ggplot(ch_pred, aes(x = house_mdn_value, y = .pred)) +
geom_point(alpha = 1/5) +
geom_abline() +
coord_obs_pred()
# Bonus: Explore variable importance
ch_final |>
extract_fit_parsnip() |>
vi()
ch_final |>
extract_fit_parsnip() |>
vip()
Clustered and longitudinal data
rsample::group_vfold_cv()
rsample::rolling_origin()
Nested cross-validation
rsample::nested_cv()
Dimensionality Reduction
recipes::step_pca()
recipes::step_nnmf()
Imputing missing predictors
recipes::step_impute_linear()
recipes::step_impute_knn()
Alternative algorithms
parsnip::linear_reg(penalty, mixture)
parsnip::logistic_reg(penalty, mixture)
parsnip::svm_rbf(cost)
Alternative tuning procedures
finetune::tune_sim_anneal()
finetune::tune_race_anova()
Combining models into ensembles
stacks::stacks()
stacks::blend_predictions()
Workflow sets and parallelization
workflowsets::workflow_set()
tune::control_grid()
Advanced/alternative metrics
yardstick::mcc()
or yardstick::mn_log_loss()
yardstick::ccc()
or yardstick::huber_loss()
Statistical model comparison
tidyposterior::perf_mod()
tidyposterior::contrast_models()
Note. The {yardstick} team is currently adding calibration metrics.
Advanced model explanations
DALEXtra::explain_tidymodels()
DALEXtra::model_parts()
DALEXtra::predict_parts()
DALEXtra::model_profile()
Bundling and deploying models
bundle::bundle()
vetiver::vetiver_model()
Online Textbook:
Tidy Modeling with R (Kuhn & Silge)
Package Reference Website:
The tidymodels framework (RStudio/Posit)
Four-day summer course:
Applied Machine Learning in R (Girard & Wang)
Meth
) rather than Cocaine use (Coke
)