Multilevel Modeling

Model Comparison

Spring 2026 | CLAS | PSYC 894
Jeffrey M. Girard | Lecture 10a

Roadmap

Model Comparison
- Deviance and Nesting
- LRT and Information Criteria
Modeling Approach
- Simple to Complex
- Sequential Comparisons

Model Comparison

Deviance

To compare models, we need to quantify their fit to the data

The deviance \((D)\) metric has some nice properties for this

It is a simple transformation of the model’s likelihood \((\mathcal{L})\)

\[D = -2\ell=-2 \log(\mathcal{L})\]

Lower deviance = Better fit
(e.g., \(1000\) is better than \(1552\))
(e.g., \(-52\) is better than \(-30\))

Nesting

We can statistically compare nested models using their deviances

If two models are nested, that means a more specific model can be derived from a more general model by removing fixed effects
- ✅ Adding fixed slopes (nested)
- ✅ Adding random slopes (nested)
- ✅ Adding random intercepts (nested)
- ❌ Swapping predictors (not nested)
- ❌ Swapping clustering variables (not nested)
- ❌❌ Swapping outcome variables (not comparable)

The Nesting Logic

Definition: Model A is nested in Model B if A is just B with some parts “turned off” (fixed at zero).
Visualizing Nesting: Think of Russian Dolls. A smaller model must fit entirely inside the larger one.
The Rule: You can add or remove terms, but you cannot swap them.
- ✅ y ~ 1 + x is nested in y ~ 1 + x + z
- ❌ y ~ 1 + x is not nested in y ~ 1 + z

Nesting Examples

✅ Adding a fixed slope
- y ~ 1 + (1 | c) to y ~ 1 + x + (1 | c)

✅ Adding a random slope
- y ~ 1 + x + (1 | c) to y ~ 1 + x + (1 + x | c)

✅ Adding random intercepts
- y ~ 1 + x to y ~ 1 + x + (1 | c)

Non-Nesting Examples

❌ Swapping predictors
- y ~ 1 + x + (1 | c) to y ~ 1 + z + (1 | c)

❌ Swapping clustering variables
- y ~ 1 + (1 | c) to y ~ 1 + (1 | g)

❌❌ Swapping outcome variables
- y ~ 1 + (1 | c) to q ~ 1 + (1 | c)

Likelihood Ratio Test

The LRT compares two nested models’ deviances

\[\Delta D = D_{\text{general}}-D_{\text{specific}}\]

It is a chi-squared test with degrees of freedom equal to the difference in the two models’ number of parameters \((q)\)

\[\Delta D \sim \chi^2(\text{df}=q_{\text{general}}-q_{\text{specific}})\]

Likelihood Ratio Test

Several important notes about the LRT
- It can only compare nested models
- The compared models should be fit with ML
- We can use test_lrt(model1, model2, ...)

Interpreting the results
- A significant result means \(\Delta D \neq 0\)
- The model with lower deviance is better
- We can examine them with get_deviance(model1)

Data Preparation

library(tidyverse)
library(glmmTMB)
library(easystats)

heck2011 <- read_csv("heck2011.csv", show_col_types = FALSE)
glimpse(heck2011)

Rows: 6,871
Columns: 7
$ school  <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2,…
$ student <dbl> 6701, 6702, 6703, 6704, 6705, 6706, 6707, 6708, 6709, 6710, 67…
$ female  <dbl> 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ ses     <dbl> 0.586, 0.304, -0.544, -0.848, 0.001, -0.106, -0.330, -0.891, 0…
$ math    <dbl> 47.1400, 63.6100, 57.7100, 53.9000, 58.0100, 59.8700, 62.5556,…
$ puniv   <dbl> 0.08333333, 0.08333333, 0.08333333, 0.08333333, 0.08333333, 0.…
$ public  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…

Why glmmTMB?

Familiar Syntax: It uses the same formula, data, and REML arguments as lmer() (defaults to REML = FALSE)
Flexibility: Unlike lmer(), it can fit models with no random effects, which will let us compare models with and without
Better Convergence: It is often more robust when models become complex (will be used a lot for troubleshooting)

Note: GLMM stands for Generalized Linear Mixed Modeling, which is multilevel modeling for non-normal outcomes, and TMB stands for Template Model Builder, which is the underlying software that powers the package

Do we need a random intercept?

m0 <- glmmTMB(math ~ 1, heck2011)
m1 <- glmmTMB(math ~ 1 + (1 | school), heck2011)
c(m0 = get_deviance(m0), m1 = get_deviance(m1))

      m0       m1 
49358.47 48875.75

test_lrt(m0, m1)

# Likelihood-Ratio-Test (LRT) for Model Comparison (ML-estimator)

Name |   Model | df | df_diff |   Chi2 |      p
-----------------------------------------------
m0   | glmmTMB |  2 |         |        |       
m1   | glmmTMB |  3 |       1 | 482.72 | < .001

m1 has significantly better fit than m0, so we need the random intercept

Do we need a fixed slope?

m1 <- glmmTMB(math ~ 1 + (1 | school), heck2011)
m2 <- glmmTMB(math ~ 1 + female + (1 | school), heck2011)
c(m1 = get_deviance(m1), m2 = get_deviance(m2))

      m1       m2 
48875.75 48840.55

test_lrt(m1, m2)

# Likelihood-Ratio-Test (LRT) for Model Comparison (ML-estimator)

Name |   Model | df | df_diff |  Chi2 |      p
----------------------------------------------
m1   | glmmTMB |  3 |         |       |       
m2   | glmmTMB |  4 |       1 | 35.20 | < .001

m2 has significantly better fit than m1, so we need the fixed slope

Do we need a random slope?

m2 <- glmmTMB(math ~ 1 + female + (1 | school), heck2011)
m3 <- glmmTMB(math ~ 1 + female + (1 + female | school), heck2011)
c(m2 = get_deviance(m2), m3 = get_deviance(m3))

      m2       m3 
48840.55 48835.32

test_lrt(m2, m3)

# Likelihood-Ratio-Test (LRT) for Model Comparison (ML-estimator)

Name |   Model | df | df_diff | Chi2 |     p
--------------------------------------------
m2   | glmmTMB |  4 |         |      |      
m3   | glmmTMB |  6 |       2 | 5.23 | 0.073

m3 was not sig. better than m2, so we don’t need the random slope

Trying Non-nested Models

m4 <- glmmTMB(math ~ 1 + public + (1 | school), heck2011)
m5 <- glmmTMB(math ~ 1 + puniv + (1 | school), heck2011)

test_lrt(m4, m5)

Error:
! The models are not nested, which is a prerequisite for
  `test_likelihoodratio()`.
  See the 'Details' section.
  You may try `test_vuong()` instead.

Information Criteria

We use “information criteria” to compare non-nested models
- They are only meaningful when the outcome is the same
- Lower is better but the comparisons are only relative
- \(\Delta\text{IC}\): 0–2 = small, 2–10 = medium, 10+ = large difference

\[ \text{AIC} = D + 2q \quad\quad\quad \text{BIC} = D + q \log(n) \]

Two popular ICs are the Akaike (AIC) and Bayesian (BIC)
- Both metrics penalize the model for adding parameters \((q)\)
- BIC is more “conservative” than AIC (i.e., adds a larger penalty)

IC for Non-nested Models

compare_performance(
  m4, # math ~ 1 + public + (1 | school)
  m5, # math ~ 1 + puniv + (1 | school)
  metrics = c("AIC", "BIC")
)

# Comparison of Model Performance Indices

Name |   Model |   AIC (weights) |   BIC (weights)
--------------------------------------------------
m4   | glmmTMB | 48883.3 (<.001) | 48910.7 (<.001)
m5   | glmmTMB | 48840.1 (>.999) | 48867.5 (>.999)

Thus, puniv is a better predictor of math than public is
The probability that fit5 is better than fit4 is 99.9%

IC for Nested Models

compare_performance(
  m0, # math ~ 1
  m1, # math ~ 1 + (1 | school)
  m2, # math ~ 1 + female + (1 | school)
  m3, # math ~ 1 + female + (1 + female | school)
  metrics = c("AIC", "BIC")
)

# Comparison of Model Performance Indices

Name |   Model |   AIC (weights) |   BIC (weights)
--------------------------------------------------
m0   | glmmTMB | 49362.5 (<.001) | 49376.1 (<.001)
m1   | glmmTMB | 48881.8 (<.001) | 48902.3 (<.001)
m2   | glmmTMB | 48848.6 (0.351) | 48875.9 (0.998)
m3   | glmmTMB | 48847.3 (0.649) | 48888.3 (0.002)

AIC prefers the complex m3 (66.7%), BIC prefers the simpler m2 (99.8%)

Modeling Approach

Bottom-up Strategy

Start with the simplest model and build up, one step at a time
Test whether each step is a significant improvement

Fit a null model (intercepts only)
Add all L1 predictors (fixed slopes only)
Add all L2 predictors (fixed slopes)
Add L1 random slopes one at a time 🔄
Add cross-level interactions one at a time 🔄

Example

dat <-
  heck2011 |> 
  degroup(select = "ses", by = "school") |>
  mutate(female = factor(female), public = factor(public)) |> 
  standardize()
glimpse(dat)

Rows: 6,871
Columns: 9
$ school      <dbl> -1.716289, -1.716289, -1.716289, -1.716289, -1.716289, -1.…
$ student     <dbl> 1.6459713, 1.6464754, 1.6469796, 1.6474837, 1.6479878, 1.6…
$ female      <fct> 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0…
$ ses         <dbl> 0.70967156, 0.34849365, -0.73760160, -1.12695651, -0.03958…
$ math        <dbl> -1.206047390, 0.668954645, -0.002721858, -0.436465498, 0.0…
$ puniv       <dbl> -2.8981142, -2.8981142, -2.8981142, -2.8981142, -2.8981142…
$ public      <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ ses_between <dbl> -0.605202, -0.605202, -0.605202, -0.605202, -0.605202, -0.…
$ ses_within  <dbl> 1.409368976, 0.943297969, -0.458220520, -0.960651677, 0.44…

Step 1

Fit null model and calculate unconditional ICC

step1 <- glmmTMB(
  math ~ 1 + (1 | school),
  data = dat,
  REML = FALSE
)
icc(step1)

# Intraclass Correlation Coefficient

    Adjusted ICC: 0.137
  Unadjusted ICC: 0.137

14% of math score variance is between schools

Step 2

Add fixed slopes for all L1 predictors

step2 <- glmmTMB(
  math ~ 1 + female + ses_within + (1 | school),
  data = dat,
  REML = FALSE
)
test_lrt(step1, step2)

# Likelihood-Ratio-Test (LRT) for Model Comparison (ML-estimator)

Name  |   Model | df | df_diff |   Chi2 |      p
------------------------------------------------
step1 | glmmTMB |  3 |         |        |       
step2 | glmmTMB |  5 |       2 | 426.98 | < .001

Including the L1 predictors is helpful

Step 3

Add fixed slopes for all L2 predictors

step3 <- glmmTMB(
  math ~ 1 + female + ses_within + 
    ses_between + public + puniv + (1 | school),
  data = dat,
  REML = FALSE
)
test_lrt(step2, step3)

# Likelihood-Ratio-Test (LRT) for Model Comparison (ML-estimator)

Name  |   Model | df | df_diff |   Chi2 |      p
------------------------------------------------
step2 | glmmTMB |  5 |         |        |       
step3 | glmmTMB |  8 |       3 | 354.67 | < .001

Including the L2 predictors is helpful

Step 4a

Add first random slope (for female)

step4a <- glmmTMB(
  math ~ 1 + female + ses_within + 
    ses_between + public + puniv + (1 + female | school),
  data = dat,
  REML = FALSE
)
test_lrt(step3, step4a)

# Likelihood-Ratio-Test (LRT) for Model Comparison (ML-estimator)

Name   |   Model | df | df_diff | Chi2 |     p
----------------------------------------------
step3  | glmmTMB |  8 |         |      |      
step4a | glmmTMB | 10 |       2 | 4.48 | 0.106

Making the female slope random is not helpful

Step 4b

Add second random slope (for ses_within)

step4b <- glmmTMB(
  math ~ 1 + female + ses_within + 
    ses_between + public + puniv + (1 + ses_within | school),
  data = dat,
  REML = FALSE
)
test_lrt(step3, step4b)

# Likelihood-Ratio-Test (LRT) for Model Comparison (ML-estimator)

Name   |   Model | df | df_diff |  Chi2 |     p
-----------------------------------------------
step3  | glmmTMB |  8 |         |       |      
step4b | glmmTMB | 10 |       2 | 10.79 | 0.005

Making the ses_within slope random is helpful

Step 5a

Add the cross-level interaction of ses_within and public

step5a <- glmmTMB(
  math ~ 1 + female + ses_within * public +
    ses_between + public + puniv + 
    (1 + ses_within | school),
  data = dat,
  REML = FALSE
)
test_lrt(step4b, step5a)

# Likelihood-Ratio-Test (LRT) for Model Comparison (ML-estimator)

Name   |   Model | df | df_diff | Chi2 |     p
----------------------------------------------
step4b | glmmTMB | 10 |         |      |      
step5a | glmmTMB | 11 |       1 | 2.78 | 0.095

Including this interaction may not be necessary

Step 5b

Add the cross-level interaction of ses_within and puniv

step5b <- glmmTMB(
  math ~ 1 + female + ses_within * puniv +
    ses_between + public + puniv + 
    (1 + ses_within | school),
  data = dat,
  REML = FALSE
)
test_lrt(step4b, step5b)

# Likelihood-Ratio-Test (LRT) for Model Comparison (ML-estimator)

Name   |   Model | df | df_diff | Chi2 |     p
----------------------------------------------
step4b | glmmTMB | 10 |         |      |      
step5b | glmmTMB | 11 |       1 | 0.56 | 0.456

Including this interaction is not helpful

Summary of LRT Results

LRT favors the model from Step 4b, so I would usually choose that
Or, if testing an SES interaction was my goal, I would use 5a or 5b

Final Model (According to LRT)

model_parameters(step4b, effects = "fixed")

# Fixed Effects

Parameter   | Coefficient |   SE |         95% CI |     z |      p
------------------------------------------------------------------
(Intercept) |        0.08 | 0.03 | [ 0.02,  0.14] |  2.81 | 0.005 
female [1]  |       -0.12 | 0.02 | [-0.16, -0.08] | -5.39 | < .001
ses within  |        0.22 | 0.01 | [ 0.20,  0.24] | 18.89 | < .001
ses between |        0.31 | 0.01 | [ 0.29,  0.34] | 21.54 | < .001
public [1]  |       -0.03 | 0.03 | [-0.09,  0.03] | -0.92 | 0.357 
puniv       |        0.05 | 0.01 | [ 0.02,  0.07] |  3.19 | 0.001

IC Model Comparison

compare_performance(
  step1, step2, step3, step4a, 
  step4b, step5a, step5b,
  metrics = c("AIC", "BIC")
)

# Comparison of Model Performance Indices

Name   |   Model |   AIC (weights) |   BIC (weights)
----------------------------------------------------
step1  | glmmTMB | 19021.3 (<.001) | 19041.8 (<.001)
step2  | glmmTMB | 18598.4 (<.001) | 18632.5 (<.001)
step3  | glmmTMB | 18249.7 (0.011) | 18304.4 (0.966)
step4a | glmmTMB | 18249.2 (0.014) | 18317.6 (0.001)
step4b | glmmTMB | 18242.9 (0.329) | 18311.3 (0.031)
step5a | glmmTMB | 18242.1 (0.486) | 18317.3 (0.001)
step5b | glmmTMB | 18244.3 (0.160) | 18319.5 (<.001)

So AIC prefers the complex Model 5a and BIC prefers the simpler Model 3

Which Metric Should I Trust?

When AIC, BIC, and LRT disagree, consider your research goal:

LRT: Best for testing a specific “yes/no” hypothesis about a parameter
AIC: Best if you want the most accurate prediction (favors complexity)
BIC: Best for building simple, clean theory (favors parsimony)

Tip

If BIC prefers the simpler model and AIC prefers the complex one, the “extra” effects are likely very small