Multilevel Modeling

Processing Multilevel Data

Spring 2026 | CLAS | PSYC 894
Jeffrey M. Girard | Lecture 04b

Roadmap

Data Tidying Recap
- Select, Filter, Mutate
Reshaping Data
- Wide vs. Long Formats
- Pivoting with tidyr
Handling Hierarchy
- Grouping & Summarizing
- Joining Level 1 & 2 Files

Data Tidying Recap

The Tidyverse Toolkit

dplyr is our primary tool for data manipulation.
It uses a set of “verbs” to describe actions on data:
- select(): Pick specific columns
- filter(): Pick specific rows
- mutate(): Create or change columns

The Pipe (|> or %>%)
- Passes the result of one function to the next.
- Read it as “and then…”

Example Dataset

library(tidyverse)
heck <- read_csv("heck2011.csv")

# Take heck AND THEN glimpse it
heck |> glimpse()

Rows: 6,871
Columns: 7
$ school  <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2,…
$ student <dbl> 6701, 6702, 6703, 6704, 6705, 6706, 6707, 6708, 6709, 6710, 67…
$ female  <dbl> 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ ses     <dbl> 0.586, 0.304, -0.544, -0.848, 0.001, -0.106, -0.330, -0.891, 0…
$ math    <dbl> 47.1400, 63.6100, 57.7100, 53.9000, 58.0100, 59.8700, 62.5556,…
$ puniv   <dbl> 0.08333333, 0.08333333, 0.08333333, 0.08333333, 0.08333333, 0.…
$ public  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…

Select & Filter

# Take heck AND THEN select columns AND THEN filter rows AND THEN print it
heck_subset <- 
  heck |> 
  select(student, school, math, starts_with("p")) |> # retain these columns
  filter(math >= 50) |> # retain students with math greater or equal to 50
  print()

# A tibble: 5,514 × 5
  student school  math  puniv public
    <dbl>  <dbl> <dbl>  <dbl>  <dbl>
1    6702      1  63.6 0.0833      0
2    6703      1  57.7 0.0833      0
3    6704      1  53.9 0.0833      0
4    6705      1  58.0 0.0833      0
5    6706      1  59.9 0.0833      0
6    6707      1  62.6 0.0833      0
# ℹ 5,508 more rows

Mutate

# Take heck AND THEN transform school and create the logmath variable
heck_transformed <- 
  heck |> 
  mutate(
    school = factor(school), # Convert to categorical (in place)
    logmath = log(math)      # Natural log transform (new column)
  ) |> 
  print()

# A tibble: 6,871 × 8
  school student female    ses  math  puniv public logmath
  <fct>    <dbl>  <dbl>  <dbl> <dbl>  <dbl>  <dbl>   <dbl>
1 1         6701      1  0.586  47.1 0.0833      0    3.85
2 1         6702      1  0.304  63.6 0.0833      0    4.15
3 1         6703      1 -0.544  57.7 0.0833      0    4.06
4 1         6704      0 -0.848  53.9 0.0833      0    3.99
5 1         6705      0  0.001  58.0 0.0833      0    4.06
6 1         6706      0 -0.106  59.9 0.0833      0    4.09
# ℹ 6,865 more rows

Reshaping Data

The Shape of MLM Data

Wide Format
- One Row = One Cluster (e.g., a subject in longitudinal study)
- One Column = One Time-Specific Score
  (e.g., Stress_T1, Stress_T2, …, Mood_T1, Mood_T2, …)
- ❌ Most MLM software cannot handle this

Long Format
- One Row = One Observation
  (e.g., Time 1 of Person 1, Time 2 of Person 1, …)
- One Column = One Variable (e.g., Time, Person, Stress, Mood)
- ✅ Most MLM software expects this

Scenario 1: Simple

# Example longitudinal dataset in wide format
wide_simple

# A tibble: 2 × 4
  subject    T1    T2    T3
    <int> <dbl> <dbl> <dbl>
1       1    10    14     9
2       2    12    11    13

This dataset is currently in wide-format
- We have 2 clusters (subjects), each with its own row
- We have 3 time points, each with its own column

But MLM wants it in long-format
- We want 6 rows, one per observation (time-specific score)
- We want 3 columns: subject, time, score

Scenario 1: Simple

# Pivot the data longer (from wide-format to long-format)
long_simple <- 
  wide_simple |> 
  pivot_longer(
    cols = c(T1, T2, T3), # which old columns contain our observations?
    names_to = "time",    # what new column should the old column names go to?
    values_to = "score"   # what new column should the old values go to?
  ) |> 
  print()

# A tibble: 6 × 3
  subject time  score
    <int> <chr> <dbl>
1       1 T1       10
2       1 T2       14
3       1 T3        9
4       2 T1       12
5       2 T2       11
6       2 T3       13

We move the old column names to a new column (“time”)
We move the old 2x3 block of values into a new column (“score”)

Scenario 2: Realistic

# Example longitudinal dataset in very-wide-format
wide_real

# A tibble: 2 × 5
  id    stress_T1 stress_T2 mood_T1 mood_T2
  <chr>     <dbl>     <dbl>   <dbl>   <dbl>
1 S001         10        14       5       7
2 S002         12        11       6       5

Real world data is often a bit more complicated…
Here, we have two scores per time point: stress and mood
- MLM needs these to be separate columns…
- Luckily, they are named consistently as “variable_time”
- We can leverage this using names_sep and names_to

Scenario 2: Realistic

# Pivot the data longer (from very-wide-format to long-format)
long_real <- 
  wide_real |> 
  pivot_longer(
    # Which old columns contain our observations?
    cols = c(starts_with("stress"), starts_with("mood")), 
    # What separates our old column names (e.g., "stress" _ "T1")
    names_sep = "_",
    # Map the parts: ".value" keeps the name as a header, "time" takes the suffix
    names_to = c(".value", "time")
  ) |> 
  print()

# A tibble: 4 × 4
  id    time  stress  mood
  <chr> <chr>  <dbl> <dbl>
1 S001  T1        10     5
2 S001  T2        14     7
3 S002  T1        12     6
4 S002  T2        11     5

Why .value is Magic

Notice what happened:

It saw stress_T1.
It split it at _.
Because we said c(".value", "time"):
- “stress” became a Column Header.
- “T1” went into the time column.

This creates a perfectly formatted MLM dataset with one row per observation, and separate columns for your predictors (\(X\)) and outcomes (\(Y\)).

Parsing Numbers

# What if we want to extract the numeric part of time ("T2" -> 2)?
long_real

# A tibble: 4 × 4
  id    time  stress  mood
  <chr> <chr>  <dbl> <dbl>
1 S001  T1        10     5
2 S001  T2        14     7
3 S002  T1        12     6
4 S002  T2        11     5

# We can use parse_number() inside of mutate()
long_real <- 
  long_real |> 
  mutate(time = parse_number(time)) |> 
  print()

# A tibble: 4 × 4
  id     time stress  mood
  <chr> <dbl>  <dbl> <dbl>
1 S001      1     10     5
2 S001      2     14     7
3 S002      1     12     6
4 S002      2     11     5

The Payoff: Visualizing Change

# Now that our data is in long-format, we can easily plot it by participant!
ggplot(long_real, aes(x = time, y = stress, color = id)) +
  geom_line() + 
  geom_point() +
  labs(title = "Individual Stress Trajectories") +
  theme_bw(base_size = 20)

Handling Hierarchy

Split–Apply–Combine

In MLM, we often need to calculate statistics per cluster
(e.g., school average math score). We do this using .by.

Split the data into groups (Schools).
Apply a calculation to each group (Mean).
Combine the results back into the dataframe.

summarize(.by) is useful for data exploration
mutate(.by) will prepare the data for MLM

Aggregating (Summarize)

# To create a L2 dataset (e.g., summaries per school), we use summarize(.by)
school_means <- 
  heck |> 
  summarize(
    .by = school,
    school_math_mean = mean(math, na.rm = TRUE),
    n_students = n()
  ) |> 
  print()

# A tibble: 419 × 3
  school school_math_mean n_students
   <dbl>            <dbl>      <int>
1      1             59.0         12
2      2             63.6         13
3      3             47.6         18
4      4             65.7         17
5      5             48.1         17
6      6             61.0         16
# ℹ 413 more rows

Annotating (Mutate)

# To simply add L2 summaries to the L1 dataset, we use mutate(.by)
heck_with_means <- 
  heck |> 
  mutate(
    .by = school,
    school_math_mean = mean(math, na.rm = TRUE)
  ) |> 
  print()

# A tibble: 6,871 × 8
  school student female    ses  math  puniv public school_math_mean
   <dbl>   <dbl>  <dbl>  <dbl> <dbl>  <dbl>  <dbl>            <dbl>
1      1    6701      1  0.586  47.1 0.0833      0             59.0
2      1    6702      1  0.304  63.6 0.0833      0             59.0
3      1    6703      1 -0.544  57.7 0.0833      0             59.0
4      1    6704      0 -0.848  53.9 0.0833      0             59.0
5      1    6705      0  0.001  58.0 0.0833      0             59.0
6      1    6706      0 -0.106  59.9 0.0833      0             59.0
# ℹ 6,865 more rows

Tip

Note that school_math_mean is repeated for every student in the same school. This is what MLM wants!

Merging: The Setup

Sometimes your data comes in separate files per level…

1. Level 1 Data (People)

Contains individual variables and the grouping ID (country).

L1_people

# A tibble: 4 × 4
    pid sex   income country
  <int> <chr>  <dbl> <chr>  
1     1 M         50 Japan  
2     2 F         55 Japan  
3     3 F         42 France 
4     4 M         48 France

2. Level 2 Data (Countries)

Contains cluster-level variables and the same grouping ID (country)

L2_countries

# A tibble: 4 × 3
  country continent   gdp
  <chr>   <chr>     <dbl>
1 China   Asia      38.2 
2 Japan   Asia       6.45
3 France  Europe     4.29
4 Germany Europe     6.14

Merging: The Join

merged_data <- 
  left_join(
    x = L1_people,    # The Level 1 file (Start with the detailed data)
    y = L2_countries, # The Level 2 file (Bring in the context)
    by = "country"    # The linking variable (cluster ID)
  ) |> 
  print()

# A tibble: 4 × 6
    pid sex   income country continent   gdp
  <int> <chr>  <dbl> <chr>   <chr>     <dbl>
1     1 M         50 Japan   Asia       6.45
2     2 F         55 Japan   Asia       6.45
3     3 F         42 France  Europe     4.29
4     4 M         48 France  Europe     4.29

Notice that Asia and 4.2 were copied for both participants in Japan
This creates the rectangular, long-format dataset required for MLM
Notice that China and Germany are excluded (no participants from there!)