Multilevel Missing Data

Lecture 9: April 30, 2025

Section 8.1: Chapter Overview

This chapter focuses on handling missing data in multilevel data sets.

What is Multilevel Data?

Hierarchically Structured: Observations are nested within higher-level organizational units
Ubiquitous: Found in many disciplines
Nesting: Can involve two or more levels (e.g., Level 1 nested in Level 2, or Level 1 in Level 2 in Level 3)

Examples of Multilevel Data

Repeated measurements nested within persons
Students grouped in classrooms or schools
Romantic partners paired within dyads
Employees nested within organizations
Survey respondents grouped in geographical regions
Clients clustered within therapists
Three-level example: Repeated measurements nested within students, students nested in schools

Data Analysis Focus

Multilevel Regression Models: Specifically models with random effects
- These are very common tools for analyzing hierarchical data
Recommended Resources:
- Gelman & Hill (2007)
- Hoffman (2015)
- Hox, Moerbeek, & Van de Schoot (2017)
- Raudenbush & Bryk (2002)
- Snijders & Bosker (2012)
- Verbeke & Molenberghs (2000)

Missing Data Handling in Multilevel Models

Recent Development: Methods specifically for multilevel missing data are relatively new and important.
Chapter Focus:
- Bayesian Estimation
- Model-Based Imputation (often more adept than ML)
- Joint Model Imputation
- Fully Conditional Specification (FCS / MICE)
- Maximum Likelihood (ML)
Core Idea: Often involves MCMC estimating factored regressions (focal + supporting models) to create distributions for missing values. Imputations = Predictions + Noise (from a multilevel model)

Section 8.2: Random Intercept Regression Models

Example: Math Problem-Solving Study

Data: Educational experiment data (from companion website)
Structure: 2-Level Hierarchy
- Level 1: Students (\(n_j \approx 34\) per school)
- Level 2: Schools (\(J=29\))
Design: Schools randomly assigned to experimental (new curriculum) or comparison (standard curriculum) condition
Outcome (Y): End-of-year math problem-solving assessment (IRT scores, 37-65)

Key Feature: Multilevel Variation

Variation and covariation exist at both levels
- Student scores vary within schools
- School average scores vary between schools
Intraclass Correlation (ICC): Proportion of total variance at Level 2
- Example: ICC \(\approx 0.26\) for problem-solving scores means ~26% of variance is between schools
Predictors: Level-1 predictors (like student math scores) can also have within- and between-group variation. Missing data methods need to preserve this.

Random Intercept Model: Concept

Definition: A regression model where the intercept coefficient (\(\beta_{0j}\)) is allowed to vary randomly across Level-2 units (groups/clusters)
Suitability: Amenable to various missing data handling methods (agnostic imputation, ML)
Example Predictors:
- Level 1: Standardized math scores (STANMATH)
- Level 2: Teacher experience (TEACHEXP)

Level-1 Model (Within-Cluster)

Describes variation among students within the same school

\[ Y_{ij} = \beta_{0j} + \beta_{1}X_{1ij} + \epsilon_{ij} \]

\(Y_{ij}\): Outcome for student \(i\) in school \(j\)
\(X_{1ij}\): Level-1 predictor value for student \(i\) in school \(j\)
\(\beta_{0j}\): Random intercept for school \(j\) (Varies across schools)
\(\beta_{1}\): Common (fixed) slope for \(X_1\) (Same across schools)
\(\epsilon_{ij}\): Level-1 residual (within-school error) for student \(i\)
- Assumed \(\epsilon_{ij} \sim N(0, \sigma_{\epsilon}^2)\)

Level-2 Model (Between-Cluster)

Models the variation in the school-specific intercepts (\(\beta_{0j}\)).

\[ \beta_{0j} = \beta_{0} + \beta_{2}X_{2j} + b_{0j} \]

\(\beta_{0j}\): The random intercept (outcome in this model)
\(\beta_{0}\): Grand mean intercept (average intercept across all schools)
\(X_{2j}\): Level-2 predictor for school \(j\)
\(\beta_{2}\): Effect of Level-2 predictor \(X_2\) on the intercept
\(b_{0j}\): Level-2 residual (random effect) for school \(j\). Captures unexplained variation in intercepts.
- Assumed \(b_{0j} \sim N(0, \sigma_{b_0}^2)\)

Combined Model (Single Equation)

Substitute the Level-2 equation into the Level-1 equation:

\[ Y_{ij} = (\beta_{0} + b_{0j}) + \beta_{1}X_{1ij} + \beta_{2}X_{2j} + \epsilon_{ij} \]

Alternatively:

\[ Y_{ij} = E(Y_{ij}|X_{1ij}, X_{2j}) + \epsilon_{ij} \]

Assumes \(Y_{ij} \sim N(E(Y_{ij}|X_{1ij}, X_{2j}), \sigma_{\epsilon}^2)\)
Interpretation: Outcome scores \(Y_{ij}\) are normally distributed around the predicted values derived from their specific school’s regression line

Factored Regression Specification

Concept: Expresses the joint distribution of all variables (\(Y, X_1, X_2\)) as a product of simpler conditional distributions using the probability chain rule. Essential for handling missing predictors
Two Main Approaches:
1. Sequential Specification
2. Partially Factored Specification

1. Sequential Specification

Factorizes the joint distribution into a product of univariate conditional distributions:

\[ f(Y, X_1, X_2) = f(Y | X_1, X_2) \times f(X_1 | X_2) \times f(X_2) \]

\(f(Y | X_1, X_2)\): The focal analysis model
\(f(X_1 | X_2)\): A model for the Level-1 predictor, conditional on Level-2 predictor(s)
\(f(X_2)\): A model for the Level-2 predictor
Important Ordering: Lower-level variables condition on higher-level variables

Sequential: Predictor Models

Level-1 Predictor Model: A random intercept model for \(X_1\) \[ X_{1ij} = (\gamma_{01} + g_{01j}) + \gamma_{11}X_{2j} + r_{1ij} \]
- \(\gamma\): Regression coefficients
- \(g_{01j}\): Random intercept residual for \(X_1 \sim N(0, \sigma_{g_{01}}^2)\). Captures between-school variation in average \(X_1\)
- \(r_{1ij}\): Within-cluster residual for \(X_1 \sim N(0, \sigma_{r_1}^2)\). Captures within-school variation in \(X_1\)
Level-2 Predictor Model: An “empty” single-level model for \(X_2\) (should be corresponding level-2 mean for level-1 variable) \[ X_{2j} = \gamma_{02} + r_{2j} \]
- \(\gamma_{02}\): Grand mean of \(X_2\)
- \(r_{2j}\): Between-cluster residual for \(X_2 \sim N(0, \sigma_{r_2}^2)\)

2. Partially Factored Specification

Factorizes into the focal model and a joint multivariate distribution for predictors:

\[ f(Y, X_1, X_2) = f(Y | X_1, X_2) \times f(X_1, X_2) \]

\(f(X_1, X_2)\): A two-part (Level-1 and Level-2) multivariate normal distribution
Decomposition: Splits L1 predictor \(X_1\) into components: \[ X_{1ij} = \underbrace{\mu_{1}}_{\text{Grand Mean}} + \underbrace{(\mu_{1j} - \mu_{1})}_{\text{Between Cluster Dev.}} + \underbrace{(X_{1ij} - \mu_{1j})}_{\text{Within Cluster Dev.}} \]
- \(\mu_{1j}\) is the latent group mean for cluster \(j\)

Partially Factored: Predictor Models

Within-Cluster Model for \(X_1\): \(X_1\) scores as deviations around “latent” (unestimated) group means \[ X_{1ij} = \mu_{1j} + r_{1ij(W)} \quad \text{where} \quad X_{1ij} \sim N(\mu_{1j}, \sigma_{r_{1(W)}}^2) \]
- \(r_{1ij(W)}\): Within-cluster residual
Between-Cluster Model for \((\mu_{1j}, X_{2j})\): Latent means (\(\mu_{1j}\)) and L2 scores (\(X_{2j}\)) are bivariate normal. \[ \begin{pmatrix} \mu_{1j} \\ X_{2j} \end{pmatrix} = \begin{pmatrix} \mu_{1} \\ \mu_{2} \end{pmatrix} + \begin{pmatrix} r_{1j(B)} \\ r_{2j(B)} \end{pmatrix} \quad \text{where} \quad \begin{pmatrix} r_{1j(B)} \\ r_{2j(B)} \end{pmatrix} \sim N_2 \left( \mathbf{0}, \Sigma_{(B)} \right) \]
- \(\Sigma_{(B)}\) is the \(2 \times 2\) between-cluster covariance matrix
Note: Can also be parameterized via round-robin regressions. Partially factored is ideal for models with centered predictors

Distribution of Missing Values: Outcome \(Y\)

Defined solely by the focal analysis model (Combined Model)
The posterior predictive distribution for \(Y_{ij(mis)}\) is: \[ Y_{ij(mis)} \sim N(E(Y_{ij}|X_{1ij}, X_{2j}), \sigma_{\epsilon}^2) \]
Imputation: Draw a random value from this normal distribution
- The mean \(E(Y_{ij}|...)\) depends on the specific cluster \(j\)’s intercept (\(\beta_{0j}\))
- The variance \(\sigma_{\epsilon}^2\) is the within-cluster residual variance

Distribution of Missing Values: Predictor \(X_1\)

Drawn from the conditional distribution \(f(X_1 | Y, X_2)\)
This distribution is proportional to the product of the focal model and the predictor model for \(X_1\): \[ f(X_1 | Y, X_2) \propto f(Y | X_1, X_2) \times f(X_1 | X_2) \]
Combining the kernels yields a normal distribution: \[ f(X_{1ij(mis)} | Y_{ij}, X_{2j}) = N(E(X_{1ij}|...), Var(X_{1ij}|...)) \]
Mean and Variance: Depend on parameters from both the focal model (\(\beta\)’s, \(\sigma_{\epsilon}^2\)) and the predictor model (\(\mu_{1j}\), \(\sigma_{r_{1(W)}}^2\)). Includes random effects (\(\beta_{0j}, \mu_{1j}\))

Distribution of Missing Values: Predictor \(X_2\)

Drawn from the conditional distribution \(f(X_2 | Y, X_1)\)
Using the partially factored specification: \[ f(X_2 | Y, X_1) \propto f(Y | X_1, X_2) \times f(X_2 | X_1) \]
\(f(Y | X_1, X_2)\) is from the focal model; \(f(X_2 | X_1)\) is from the between-cluster predictor model
Important: \(X_2\) is constant within cluster \(j\). The focal model’s contribution is repeated \(n_j\) times: \[ f(X_{2j(mis)}|...) \propto \left[ \prod_{i=1}^{n_j} N(E(Y_{ij}|...), \sigma_{\epsilon}^2) \right] \times N(E(X_{2j}|\mu_{1j}), \sigma_{r_{2(B)}}^2) \]
Imputation: Often requires sampling methods like Metropolis-Hastings due to complexity

MCMC Algorithm: Overview

Posterior Distribution: Complex function describing joint probability of parameters, random effects, latent means, and missing values given observed data
Core Logic: Extends standard Bayesian MCMC
- Estimate one unknown quantity at a time (parameter, latent variable, missing value)
- Condition on the current values of all other quantities
Full conditional distributions are available in the literature

MCMC Algorithm: Generic Recipe

Assign starting values (parameters, random effects, missing values)
Do for \(t=1\) to T iterations:
- Estimate focal model’s parameters, given everything else
- Estimate focal model’s random effects, given everything else
- Estimate each predictor model’s parameters, given everything else
- Estimate each predictor model’s random effects, given everything else
- Impute dependent variable (\(Y_{mis}\)) given focal model parameters
- Impute each predictor (\(X_{mis}\)) given focal and supporting models
**Repeat*

Analysis Example: Setup (Math Study)

Model: Random intercept regression \[ PROBSOLVE_{ij} = (\beta_{0} + b_{0j}) + \beta_{1}PRETEST_{ij} + \beta_{2}STANMATH_{ij} \] \[ + \beta_{3}FRLUNCH_{ij} + \beta_{4}TEACHEXP_{j} + \beta_{5}CONDITION_{j} + \epsilon_{ij} \]
Predictors:
- L1: PRETEST (complete), STANMATH (7.3% missing), FRLUNCH (binary, 4.7% missing)
- L2: TEACHEXP (10.3% missing), CONDITION (complete)
Outcome: PROBSOLVE (20.5% missing)
Goal: Estimate treatment effect (\(\beta_5\)) controlling for covariates

NOTE: Example analysis is likely incorrectly specified (no level-2 variables from corresponding level-1 variables likely means conflated effects)

Analysis Example: Factored Regression Options

Sequential: Product of univariate distributions. Order: Incomplete L1, Complete L1, Incomplete L2, Complete L2. Use latent response for binary predictors (FRLUNCH).
Partially Factored: Multivariate normal for predictors (PRETEST, STANMATH, FRLUNCH, TEACHEXP, CONDITION)
- Level 1 predictors decomposed
- Within-cluster model: Correlated deviations around latent group means (\(\mu_j\))
- Between-cluster model: Latent group means + L2 vars are multivariate normal
- Preferred here due to centering

Random Coefficient Models

Example: Daily Diary Study (Health Psych)

Data: \(J=132\) participants with chronic pain
Structure: 2-Level Hierarchy (Repeated Measures)
- Level 1: Daily assessments (up to \(n_j=21\) days: mood, sleep, pain)
- Level 2: Persons
Variables: L1 (daily), L2 (person-level demographics, psych variables like pain acceptance, catastrophizing)
ICC Example: Positive Affect ICC \(\approx 0.63\). High between-person variation typical for repeated measures

Random Coefficient (Slope) Model: Concept

Definition: A multilevel regression where the influence (slope) of one or more Level-1 predictors varies across Level-2 units
Example:
- L1: Daily PAIN (\(X_1\)) predicting daily POSAFFECT (\(Y\))
- L2: Person-average PAIN (\(\mu_{1j}\)) and PAINACCEPT (\(X_2\)) predicting average POSAFFECT
- Key: The effect of daily PAIN on POSAFFECT can differ from person to person

Level-1 Model (Within-Person)

Describes daily variation within the same person.

\[ Y_{ij} = \beta_{0j} + \beta_{1j}(X_{1ij} - \mu_{1j}) + \epsilon_{ij} \]

\(Y_{ij}\): Positive affect for person \(j\) on day \(i\)
\(X_{1ij}\): Pain rating for person \(j\) on day \(i\)
\(\beta_{0j}\): Random intercept for person \(j\) (person \(j\)’s average affect)
\(\beta_{1j}\): Random slope for person \(j\) (person \(j\)’s daily pain-affect association)
\(\mu_{1j}\): Latent person-mean for \(X_1\) (pain). \(X_1\) is group-mean centered
\(\epsilon_{ij}\): Level-1 residual (\(\sim N(0, \sigma_{\epsilon}^2)\))

Interpretation: Regression lines vary in intercept and slope across persons

Level-2 Model (Between-Person)

Models variation in person-specific intercepts (\(\beta_{0j}\)) and slopes (\(\beta_{1j}\))

\[ \beta_{0j} = \beta_{0} + \beta_{2}(\mu_{1j} - \mu_{1}) + \beta_{3}(X_{2j} - \mu_{2}) + b_{0j} \] \[ \beta_{1j} = \beta_{1} + b_{1j} \]

\(\beta_0, \beta_1\): Grand mean intercept & slope
\(\mu_{1j}\): Latent person-mean pain (predictor for \(\beta_{0j}\))
\(X_{2j}\): Pain acceptance (predictor for \(\beta_{0j}\))
\(\mu_1, \mu_2\): Grand means for predictors (used for centering)
\(\beta_2, \beta_3\): Fixed effects of L2 predictors on intercept
\(b_{0j}, b_{1j}\): L2 random effects (residuals for intercept & slope). Assumed \(\begin{pmatrix} b_{0j} \\ b_{1j} \end{pmatrix} \sim N_2(\mathbf{0}, \Sigma_b)\)

Combined Model (Reduced Form)

Substitute L2 into L1:

\[ Y_{ij} = \underbrace{(\beta_{0} + \beta_{2}(\mu_{1j} - \mu_{1}) + \beta_{3}(X_{2j} - \mu_{2}) + b_{0j})}_{\beta_{0j}} + \underbrace{(\beta_{1} + b_{1j})}_{\beta_{1j}}(X_{1ij} - \mu_{1j}) + \epsilon_{ij} \]

Alternatively: \(Y_{ij} = E(Y_{ij}|X_{1ij}, X_{2j}) + \epsilon_{ij}\) where \(Y_{ij} \sim N(E(Y_{ij}|...), \sigma_{\epsilon}^2)\)

This normal distribution defines \(P(Y_{ij(mis)}|...)\)

Missing Data Challenge: Nonlinearity

Random coefficient models feature a product term: \(X_{1ij} \times \beta_{1j}\)
- Level-1 predictor (\(X_1\)) multiplied by a Level-2 latent variable/random effect (\(\beta_{1j}\))
Problem: Handling missingness when the L1 predictor (\(X_1\)) involved in the random slope is incomplete is challenging
- Some methods (e.g., current ML estimators) are prone to substantial bias
Solution: Bayesian estimation / Model-based MI using factored regression is effective

Factored Regression Specification

Partially Factored: Ideal due to centering \[ f(Y | X_1, X_2) \times f(X_1, X_2) \]
The predictor model \(f(X_1, X_2)\) is specified as before
- The fact that the focal model \(f(Y|...)\) has a random slope doesn’t change the form of the predictor model specification

Distribution of Missing Values: Outcome \(Y\)

Defined solely by the focal analysis model
Posterior predictive distribution for \(Y_{ij(mis)}\): \[ Y_{ij(mis)} \sim N(E(Y_{ij}|X_{1ij}, X_{2j}), \sigma_{\epsilon}^2) \]
Imputation: Draw from this normal distribution
- Mean \(E(Y_{ij}|...)\) incorporates person \(j\)’s specific intercept (\(\beta_{0j}\)) AND slope (\(\beta_{1j}\))

Distribution of Missing Values: Predictor \(X_1\)

Drawn from conditional distribution \(f(X_1 | Y, X_2) \propto f(Y | X_1, X_2) \times f(X_1 | X_2)\)
Combining kernels yields a normal distribution: \[ f(X_{1ij(mis)} | Y_{ij}, X_{2j}) = N(E(X_{1ij}|...), Var(X_{1ij}|...)) \]
\(E(X_{1ij}|...) = Var(X_{1ij}|...) \times \left( \frac{\mu_{1j}}{\sigma_{r_{1(W)}}^2} + \frac{\beta_{1j}(Y_{ij} - (\beta_{0j} + \beta_{2}(\mu_{1j}-\mu_1) + ...))}{\sigma_{\epsilon}^2} \right)\)
\(Var(X_{1ij}|...) = \left( \frac{1}{\sigma_{r_{1(W)}}^2} + \frac{\beta_{1j}^2}{\sigma_{\epsilon}^2} \right)^{-1}\)

Key Issue: Heteroscedasticity in \(X_1\) Imputations

Look at the variance term for \(X_{1ij(mis)}\): \[ Var(X_{1ij}|Y_{ij}, X_{2j}) = \left( \frac{1}{\sigma_{r_{1(W)}}^2} + \frac{\boldsymbol{\beta_{1j}^2}}{\sigma_{\epsilon}^2} \right)^{-1} \]
The variance (spread) of the imputation distribution depends on the person-specific random slope squared (\(\beta_{1j}^2\))
Implication: Imputations for \(X_1\) should have different variances for different people. This is heteroscedastic
Problem: Methods assuming multivariate normality for imputation (e.g., standard FCS, current ML in SEM) cannot capture this heteroscedasticity easily and are prone to bias (especially underestimating slope variance)

Analysis Example: Setup (Diary Study)

Model: Random coefficient model for POSAFFECT \[ POSAFFECT_{ij} = \beta_{0j} + \beta_{1j}(PAIN_{ij} - \mu_{1j}) + \beta_{2}(SLEEP_{ij} - \mu_{2}) \] \[ + \beta_{3}(\mu_{1j} - \mu_{1}) + \beta_{4}(PAINACCEPT_{j} - \mu_{2}) + \beta_{5}FEMALE_{j} + \epsilon_{ij} \]
Predictors:
- L1: PAIN (random slope, latent group-mean centered), SLEEP (fixed slope, grand-mean centered)
- L2: Mean PAIN (\(\mu_{1j}\)), PAINACCEPT, FEMALE (binary)
Factored Regression: Partially factored, includes SLEEP, PAINACCEPT, FEMALE* in predictor model \(f(...)\)

Section 8.4: Multilevel Interaction Effects

Example: Employee Empowerment Study

Data: \(N=630\) employees from \(J=105\) workgroups/teams (\(n_j=6\))
Structure: 2-Level Hierarchy
- Level 1: Employees
- Level 2: Workgroups/Teams
Outcome: Employee Empowerment (\(Y\)). ICC \(\approx 0.11\)
Variables: LMX (Leader-Member Exchange), Gender (MALE), Team Cohesion, Team Leadership Climate

Cross-Level Interaction: Concept

Definition: The influence (slope) of a Level-1 predictor (\(X_1\)) on \(Y\) is moderated by a Level-2 predictor (\(X_2\))
Often involves random slopes for \(X_1\), as the interaction term helps explain why slopes vary across groups
Example:
- L1 Predictor: LMX (within-team relationship quality)
- L2 Moderator: CLIMATE (team leadership climate)
- Interaction: Does the LMX-Empowerment relationship depend on team Climate?

Model Specification

\[ Y_{ij} = \beta_{0j} + \beta_{1j}(X_{1ij} - \mu_{1j}) + \beta_{2}(X_{2ij}^* - \mu_{2j}) + \beta_{3}(X_{3j} - \mu_{3}) \] \[ + \beta_{4}(X_{4j} - \mu_{4}) + \boldsymbol{\beta_{5}(X_{1ij} - \mu_{1j})(X_{4j} - \mu_{4})} + \epsilon_{ij} \]

\(Y_{ij}\): Empowerment
\(X_{1ij}\): LMX (L1 predictor, random slope \(\beta_{1j}\), latent group-mean centered)
\(X_{2ij}^*\): MALE (L1 covariate, latent group-mean centered)
\(X_{3j}\): COHESION (L2 covariate, grand-mean centered)
\(X_{4j}\): CLIMATE (L2 moderator, grand-mean centered)
\(\beta_5\): Cross-level interaction coefficient

Factored Regression Specification

Partially Factored: Preferred due to latent mean centering \[ f(Y | Xs, Interaction) \times f(X_1, X_2^*, X_3, X_4) \]
Predictor Model \(f(...)\):
- L1 vars (\(X_1, X_2^*\)) decomposed into within/between components
- Within-Cluster: Models \((X_{1ij}, X_{2ij}^*)\) around latent group means \((\mu_{1j}, \mu_{2j})\) using \(\Sigma_{(W)}\). \(X_2^*\) is latent response for MALE
- Between-Cluster: Models \((\mu_{1j}, \mu_{2j}, X_{3j}, X_{4j})\) around grand means using \(\Sigma_{(B)}\)

Distribution of Missing Values

Similar to Random Coefficient Model:
- Missing \(Y\): Drawn from focal model, involves \(\beta_{0j}, \beta_{1j}\)
- Missing Predictors (e.g., \(X_1\) = LMX):
  - Conditional distribution depends on focal and predictor models
  - Heteroscedasticity: Variance depends on random slope (\(\beta_{1j}^2\)) and interaction term involving \(\beta_5\)
  - Requires methods that handle this (Bayesian/Model-Based MI). Metropolis-Hastings useful

Three-Level Models

Example: Educational Study Revisited (3 Levels)

Data: Cluster-randomized trial from Sec 8.2, but now using longitudinal data
Structure: 3-Level Hierarchy
- Level 1: Measurement Occasions (\(t=1...7\), monthly)
- Level 2: Students (\(i\))
- Level 3: Schools (\(j\))
Outcome: Problem Solving score at each occasion (\(PROBSOLVE_{tij}\))
Missing Data: Increased over time (~20% by final wave). Planned missingness in control group

Longitudinal Growth Curve Model

Concept: Type of MLM where repeated measures are modeled as a function of time
Time Predictor (MONTH): Codes passage of time (L1 predictor)
- Example coding: Relative to final assessment (\(MONTH = -6, -5, ..., 0\)). Intercept is end-of-year score
Time Variation: MONTH only varies within-student (L1). Constant across students and schools (No L2 or L3 variance)

Level-1 Model (Within-Student)

Models individual change trajectories.

\[ PROBSOLVE_{tij} = \beta_{0ij} + \beta_{1ij}(MONTH_{tij}) + \epsilon_{tij} \]

\(\beta_{0ij}\): Student \(i\)’s expected score at MONTH=0 (end-of-year)
\(\beta_{1ij}\): Student \(i\)’s linear rate of change per month
Both \(\beta_{0ij}\) and \(\beta_{1ij}\) are random effects varying across students (L2) and schools (L3)
\(\epsilon_{tij}\): Time-specific residual (\(\sim N(0, \sigma_{\epsilon}^2)\))

Level-2 Model (Between-Student)

Models student-level variation in intercepts (\(\beta_{0ij}\)) and slopes (\(\beta_{1ij}\))

\[ \beta_{0ij} = \beta_{0j} + \beta_{2}(STANMATH_{ij} - \mu_{2}) + \beta_{3}(FRLUNCH_{ij}^* - \mu_{3}) + b_{0ij} \] \[ \beta_{1ij} = \beta_{1j} + b_{1ij} \]

\(\beta_{0j}, \beta_{1j}\): School \(j\)’s average intercept & slope
\(STANMATH_{ij}, FRLUNCH_{ij}^*\): Student-level (L2) covariates predicting intercept (grand-mean centered)
\(b_{0ij}, b_{1ij}\): Student-level random effects (deviations from school means). \(\sim N_2(\mathbf{0}, \Sigma_{b(L2)})\)

Level-3 Model (Between-School)

Models school-level variation in intercepts (\(\beta_{0j}\)) and slopes (\(\beta_{1j}\))

\[ \beta_{0j} = \beta_{0} + \beta_{4}(TEACHEXP_{j} - \mu_{4}) + \beta_{5}(CONDITION_{j}) + b_{0j} \] \[ \beta_{1j} = \beta_{1} + \beta_{6}(CONDITION_{j}) + b_{1j} \]

\(\beta_0, \beta_1\): Grand mean intercept & slope (for control group, CONDITION=0)
\(TEACHEXP_j, CONDITION_j\): School-level (L3) covariates
\(\beta_5\): Intercept difference for intervention group (at MONTH=0)
\(\beta_6\): Slope difference for intervention group (Treatment \(\times\) Time interaction)
\(b_{0j}, b_{1j}\): School-level random effects. \(\sim N_2(\mathbf{0}, \Sigma_{b(L3)})\)

Combined Model (Reduced Form)

\[ Y_{tij} = (\beta_{0} + b_{0ij} + b_{0j}) + (\beta_{1} + b_{1ij} + b_{1j})MONTH_{tij} + \beta_{2}(X_{2ij} - \mu_{2}) \] \[ + \beta_{3}(X_{3ij}^* - \mu_{3}) + \beta_{4}(X_{4j} - \mu_{4}) + \beta_{5}X_{5j} + \boldsymbol{\beta_{6}MONTH_{tij}X_{5j}} + \epsilon_{tij} \]

\(Y_{tij} \sim N(E(Y_{tij}|...), \sigma_{\epsilon}^2)\). Defines distribution for missing \(Y\)
Key effects: \(\beta_5\) (main effect of Condition at end-of-year), \(\beta_6\) (Condition \(\times\) Month interaction)

Factored Regression Specification

Partially Factored: \(f(Y | Xs) \times f(Xs)\)
Decomposition: L1 predictors decomposed into L1(within-L2), L2(within-L3), and L3 components \[ X_{1tij} = \mu_1 + (\mu_{1j}-\mu_1) + (\mu_{1ij}-\mu_{1j}) + (X_{1tij}-\mu_{1ij}) \]
- (Note: This formula is conceptual; specific models depend on variance components)

Factored Regression: Predictor Models

MONTH Predictor: Only L1 variation. Modeled as deviation from grand mean. No L2/L3 random effects needed if complete \[ MONTH_{tij} = \mu_{1} + r_{1tij(W)} \]
Level-2 Predictor Model: Models L2 vars (STANMATH, FRLUNCH*) around L3 latent means (\(\mu_{2j}, \mu_{3j}\)) using \(\Sigma_{(L2)}\) \[ X_{ij(L2)} \sim N_2(\boldsymbol{\mu}_j, \Sigma_{(L2)}) \]
Level-3 Predictor Model: Models L3 latent means (\(\mu_{2j}, \mu_{3j}\)) + L3 vars (TEACHEXP, CONDITION*) around grand means using \(\Sigma_{(L3)}\) \[ X_{j(L3)} \sim N_4(\boldsymbol{\mu}, \Sigma_{(L3)}) \]

Multiple Imputation Strategies

Recap: Agnostic vs. Model-Based MI

Agnostic Imputation: Imputation model differs from analysis model (e.g., Joint Modeling, FCS)
- Multilevel extensions exist
- Generally suitable for random intercept models
Model-Based Imputation: Imputation model is tailored to (or is the same as) the analysis model (e.g., from Bayesian estimation)
- Essential for models with random coefficients, interactions, or other nonlinearities to avoid bias

Caution: Single-Level MI on Multilevel Data

Applying standard (single-level) JM or FCS to multilevel data is problematic
Why? These methods ignore the data hierarchy (nesting)
- They produce imputed values with no between-cluster variation
- Leads to biased estimates (e.g., attenuated L2 effects, incorrect SEs)
Use multilevel versions of JM/FCS or model-based approaches instead

Fixed Effect Imputation: Concept

An alternative strategy in some situations
Method:
1. Create dummy variables (\(D_k\)) for each Level-2 group (\(k=1...J\))
2. Include these dummy variables as predictors in a single-level imputation model
3. Effectively treats group membership as a fixed effect during imputation

Fixed Effect Imputation: When to Consider?

May be useful when:
- The number of clusters (\(J\)) is very small (makes MLM estimation hard)
- Level-2 groups are not considered a random sample from a larger population
- Between-cluster differences are viewed as nuisance variation to be controlled, not phenomena of interest

Fixed Effect Imputation: Model

Example for imputing outcome \(Y\) in a random intercept context:

\[ Y_{ij} = \sum_{k=1}^{J} \gamma_{k} D_{kj} + \gamma_{J+1} X_{1ij} + \epsilon_{ij} \]

\(D_{kj}=1\) if unit \(i\) is in group \(k\), 0 otherwise
\(\gamma_k\) is the estimated intercept for group \(k\)
Uses absolute coding (all \(J\) dummies, no overall intercept \(\beta_0\))
Note: Excludes L2 predictors (e.g., \(X_{2j}\)) because the dummy codes account for all between-group variance in \(Y\)

Fixed Effect Imputation: Limitations

Computationally simple
Potential Biases:
- Can overcompensate for group differences, exaggerating between-group variation (Lüdtke et al., 2017). Bias worse with low ICC or small \(n_j\).
- May produce positively biased standard errors and inaccurate confidence intervals (Andridge, 2011; van Buuren, 2011)
Practical Limitation: Difficult to extend beyond random intercept models. Preserving random slopes would require many product terms (\(D_{kj} \times X_{1ij}\))

Joint Model (JM) Imputation

JM Imputation: Overview

Framework: Uses a multivariate normal distribution for continuous & latent response variables, decomposed into within- & between-cluster parts
Approach: Often uses an “empty” multivariate model (all variables as outcomes) for imputation
Handles: Missing data at L1/L2, categorical variables (via latent response)
Standard JM Limitation: Assumes common within-cluster covariance (\(\Sigma_{(W)}\)) across groups. Suitable for random intercept models, but biased for random slope models

JM Model: Within-Cluster

Models L1 scores as correlated deviations around latent group means \(\boldsymbol{\mu}_j\)

\[ \mathbf{Y}_{ij(W)} = \begin{pmatrix} PROBSOLVE_{ij} \\ PRETEST_{ij} \\ STANMATH_{ij} \\ FRLUNCH_{ij}^* \end{pmatrix} = \boldsymbol{\mu}_j + \mathbf{r}_{ij(W)} \]

Assumes \(\mathbf{Y}_{ij(W)} \sim N_4(\boldsymbol{\mu}_j, \boldsymbol{\Sigma_{(W)}})\)
\(\boldsymbol{\Sigma_{(W)}}\) is the common within-cluster covariance matrix
\(FRLUNCH^*\) is latent response variable (variance fixed to 1)
Defines posterior predictive distribution for L1 missing values

JM Model: Between-Cluster

Models latent group means (\(\boldsymbol{\mu}_j\)) and L2 variables

\[ \mathbf{Y}_{j(B)} = \begin{pmatrix} \mu_{1j} \\ \mu_{2j} \\ \mu_{3j} \\ \mu_{4j} \\ TEACHEXP_j \\ CONDITION_j^* \end{pmatrix} = \boldsymbol{\mu} + \mathbf{r}_{j(B)} \]

Assumes \(\mathbf{Y}_{j(B)} \sim N_6(\boldsymbol{\mu}, \boldsymbol{\Sigma_{(B)}})\)
\(\boldsymbol{\Sigma_{(B)}}\) is the between-cluster covariance matrix
\(CONDITION^*\) is latent response variable (variance fixed to 1)
Defines posterior predictive distribution for L2 missing values & latent means

JM MCMC Algorithm

Generates imputations using Gibbs sampling:

Initialize: Parameters (\(\boldsymbol{\mu}, \Sigma_{(W)}, \Sigma_{(B)}\)), latent means (\(\boldsymbol{\mu}_j\)), missing values
Iterate (t=1…T):
- Estimate grand means \(\boldsymbol{\mu}\)
- Estimate latent group means \(\boldsymbol{\mu}_j\)
- Estimate between-cluster covariance \(\boldsymbol{\Sigma_{(B)}}\)
- Estimate within-cluster covariance \(\boldsymbol{\Sigma_{(W)}}\)
- Impute missing values (using conditional MVN distributions derived from current parameters)
Repeat for M parallel chains or burn-in/thinning

JM Extension: Random Within-Cluster Covariances

Motivation: Standard JM assumes common \(\Sigma_{(W)}\), problematic for random slopes
Solution (Yucel, 2011): Allow \(\Sigma_{(W)}\) to vary across L2 units (\(j\)) \[ \mathbf{Y}_{ij(W)} \sim N_k(\boldsymbol{\mu}_j, \boldsymbol{\Sigma_{j(W)}}) \]
Between-Cluster Model: Remains the same
Modeling \(\Sigma_{j(W)}\): Treat each \(\Sigma_{j(W)}\) as drawn from a common Wishart distribution (multivariate generalization of \(\chi^2\)).
- Wishart defined by pooled degrees of freedom & scale matrix

Random Covariance MCMC

Similar to standard JM MCMC, but adds steps within the loop:
- Estimate pooled scale matrix (average \(\Sigma_{j(W)}\) structure)
- Estimate pooled degrees of freedom (related to average \(n_j\))
- Estimate cluster-specific \(\boldsymbol{\Sigma_{j(W)}}\) based on its data and borrowing strength via the Wishart prior
Allows imputation even if variables are fully missing within some clusters

Fully Conditional Specification (FCS / MICE) Imputation

FCS Imputation: Overview

Concept: Extends single-level FCS (MICE) to multilevel data (van Buuren, 2011)
Method: Imputes variables one at a time using univariate regression models, conditional on all other variables. Cycles through variables iteratively
Handles: L1, L2, L3 variables. Different model types (linear, logistic, etc.) per variable
Standard FCS Limitation: Like JM, generally limited to random intercept models due to how random slopes are handled (or not handled)

Standard FCS: L1 Imputation Models

Uses random intercept regressions for L1 variables. Example :

Continuous \(Y_{ij}\) (e.g., PROBSOLVE): \[ Y_{ij}^{(t)} = \gamma_{01j} + \gamma_{11}X_{1ij}^{(t-1)} + ... + r_{1ij} \]
Binary \(Y_{ij}\) (e.g., FRLUNCH): \[ ln\left(\frac{Pr(Y_{ij}^{(t)}=1)}{1-Pr(Y_{ij}^{(t)}=1)}\right) = \gamma_{03j} + \gamma_{13}X_{1ij}^{(t)} + ... \]
Impute \(Y_{ij(mis)}^{(t)}\) by drawing from posterior predictive distribution (Normal or Binomial). \((t)\) indicates iteration

Standard FCS: L2 Imputation Model

Uses single-level regression with cluster means (\(\bar{X}\)) of L1 vars as predictors

\[ X_{2j}^{(t)} = \gamma_{04} + \gamma_{14}\bar{Y}_{1j}^{(t)} + \gamma_{24}\bar{X}_{1j}^{(t)} + ... + \gamma_{54}X_{L2,j} + r_{04j} \]

\(\bar{Y}_{1j}^{(t)}, \bar{X}_{1j}^{(t)}\) are arithmetic averages of (imputed) L1 variables within cluster \(j\) at iteration \(t\)
Impute \(X_{2j(mis)}^{(t)}\) by drawing from its posterior predictive distribution

Standard FCS: Limitations

Common Slope Assumption: Standard specification doesn’t allow within- vs. between-cluster associations to differ (unlike JM)
- Workaround: Add cluster means as predictors in L1 models (mimics JM)
Assumes Equal Cluster Sizes (\(n_j\)): Using arithmetic means (\(\bar{X}\)) in L2 models ignores differential reliability due to varying \(n_j\). Biases tend to be small unless ICC or \(n_j\) very small (Grund et al., 2017)

FCS with Latent Variables: Overview

Alternative Formulation: Addresses limitations of standard FCS (Enders et al., 2018; Keller & Enders, 2021)
Advantages:
- Equivalent to Joint Model (JM) specification
- Naturally handles unequal cluster sizes (\(n_j\))
- Uses latent response variables for categorical data
- Uses latent group means (\(\mu_j\)) instead of arithmetic means (\(\bar{X}\))

FCS Latent: Within-Cluster Regressions

Reparameterizes within-cluster MVN distribution (\(\Sigma_{(W)}\)) as round-robin regressions

\[ Y_{1ij}^{(t)} = \mu_{1j} + \gamma_{11(W)}(Y_{2ij}^{(t-1)}-\mu_{2j}) + ... + r_{1ij(W)} \] \[ Y_{2ij}^{(t)} = \mu_{2j} + \gamma_{12(W)}(Y_{3ij}^{(t-1)}-\mu_{3j}) + ... + r_{2ij(W)} \] … (one equation per L1 variable)

Regressors are centered at latent group means (\(\mu_j\))
Intercept is the latent group mean of the variable being imputed
Uses latent response variables (\(Y^*\)) for binary/ordinal data

FCS Latent: Between-Cluster Regressions

Reparameterizes between-cluster MVN distribution (\(\Sigma_{(B)}\)) as round-robin regressions

\[ \mu_{1j}^{(t)} = \mu_{1} + \gamma_{11(B)}(\mu_{2j}^{(t-1)}-\mu_{2}) + ... + \gamma_{51(B)}(Y_{6j}^{*(t-1)}-\mu_{6}) + r_{1j(B)} \] … (one equation per latent mean \(\mu_j\) and L2 variable \(Y_k\))

Models latent means (\(\mu_j\)) and L2 variables (\(Y_k\), possibly latent \(Y_k^*\))
Regressors are centered at grand means (\(\mu\))

FCS Latent: Imputing Latent Means (\(\mu_j\))

Latent means (\(\mu_j\)) are treated like missing data and updated each iteration
Drawn from complex conditional posterior distribution: \[ f(\mu_{1j}|...) \propto \left[ \prod_{i=1}^{n_j} N(E(Y_{1ij}|...), \sigma_{Y_{1(W)}}^2) \right] \times N(E(\mu_{1j}|...), \sigma_{r_{1(B)}}^2) \]
Distribution depends on L1 data within cluster \(j\) (weighted by \(n_j\)) AND L2 model
Explicitly accounts for unequal cluster sizes \(n_j\)

FCS Limitation: Random Coefficients

Standard FCS approaches are generally not suitable for models with random coefficients (slopes)
Why? The way FCS imputes the L1 predictor involved in the random slope is often incompatible with the analysis model

Problem: Reverse Random Coefficient Imputation

Consider analysis model: \(Y_{ij} = \beta_{0j} + \beta_{1j}X_{ij} + \epsilon_{ij}\)
A seemingly plausible FCS imputation model for \(X_{ij}\) might be: \[ X_{ij} = \gamma_{0j} + \gamma_{1j}Y_{ij} + r_{ij} \] (i.e., a random coefficient model predicting \(X\) from \(Y\))
Incompatibility: These two models are logically inconsistent unless the random slope variance (\(\sigma^2_{\beta_1}\)) is zero. They cannot arise from the same underlying joint distribution

Issue: Heteroscedasticity Ignored

Recall the correct conditional distribution for \(X_{1ij(mis)}\) from Bayesian perspective: \[ Var(X_{1ij}|Y_{ij}, ...) = \left( \frac{1}{\sigma_{r_{1(W)}}^2} + \frac{\boldsymbol{\beta_{1j}^2}}{\sigma_{\epsilon}^2} \right)^{-1} \]
The variance depends on the cluster-specific \(\beta_{1j}^2\) (heteroscedastic)
The reverse regression model incorrectly assumes a constant residual variance (\(\sigma_r^2\)) for \(X_{ij}\) across all clusters \(j\). It fails to capture the necessary heteroscedasticity

Consequences & Recommendation

Bias: Using standard FCS (incl. reverse random coefficient models) when the analysis model has random slopes leads to bias, particularly underestimation of the random slope variance.
Connection: Similar issue to “just-another-variable” imputation for interaction models (Sec 5.4). Random slope models are a type of interaction (\(X_{1ij} \times \beta_{1j}\))
Recommendation: For random coefficient models, use Bayesian estimation or Model-Based MI derived from the correctly specified factored regression model. Avoid standard FCS

Maximum Likelihood (ML) Estimation

ML Overview

Currently arguably less capable than Bayesian/MI for complex multilevel missing data problems (e.g., random slopes with missing predictors)
Handles a more limited set of scenarios effectively

ML: Incomplete Outcomes Only

Handled by standard mixed model software (e.g., lme4, nlme, SAS PROC MIXED, SPSS MIXED)
If missingness depends only on observed predictors (MAR), analyzing the observed \(Y\) values gives valid ML estimates
No imputation needed; equivalent to complete-data estimation with unbalanced cluster sizes (\(n_j\))

ML: Incomplete Predictors - Challenges

Many standard MLM packages default to listwise deletion if predictors are missing
- Assumes MCAR (often unrealistic)
- Reduces sample size, potentially discarding entire clusters
Need specialized ML approaches

ML Approach 1: HLM Software

Method: Shin & Raudenbush (2007, 2013) approach implemented in HLM software
Assumptions: Incomplete predictors are multivariate normal (MVN)
Limitations:
- No capacity for incomplete categorical predictors
- No capacity for random slopes between incomplete L1 variables
Mechanism: Reparameterizes MLM into within/between MVN components (like JM). Uses EM algorithm to estimate means/covariances, transforms back to regression parameters

ML Approach 2: Multilevel SEM

Method: Use Full Information Maximum Likelihood (FIML) within a Structural Equation Modeling framework
Assumption: Typically assumes incomplete variables are MVN
Capability: Can specify models with incomplete random slope predictors
Limitation: Prone to substantial bias when predictors in random slopes are missing (Enders et al., 2018, 2020). Fails to account for heteroscedasticity
Recommendation: Best suited for random intercept models when predictors are missing (assuming MVN)

Multilevel SEM Framework

Recap SEM (Ch 3): Models individual data vector \(\mathbf{Y}_i \sim N(\boldsymbol{\mu}(\theta), \boldsymbol{\Sigma}(\theta))\). FIML uses available data for each case
Multilevel SEM: Unit of analysis is the cluster (\(j\)). \(\mathbf{Y}_j\) is the vector of all L1 observations for cluster \(j\)
- \(\mathbf{Y}_j = (Y_{1j}, Y_{2j}, ..., Y_{n_j, j})'\)
- Assumes \(\mathbf{Y}_j\) follows a multivariate normal distribution with a structured mean vector \(\boldsymbol{\mu}_j(\theta)\) and covariance matrix \(\boldsymbol{\Sigma}_j(\theta)\) derived from the MLM parameters \(\theta\)

Summary

Multilevel Data: Characterized by hierarchical nesting (e.g., measurements within persons, students within schools), leading to variance and covariance at multiple levels
Models Discussed: The chapter covers random intercept, random coefficient (random slope), cross-level interaction, and three-level (e.g., growth curve) models. Centering strategies (latent group-mean, grand-mean) are crucial
Core Missing Data Challenge: Handling missing predictors, especially Level-1 predictors involved in random slopes or interactions, is complex due to induced heteroscedasticity in their conditional distributions