Multilevel Missing Data
Lecture 9: April 30, 2025
Section 8.1: Chapter Overview
This chapter focuses on handling missing data in multilevel data sets.
What is Multilevel Data?
- Hierarchically Structured: Observations are nested within higher-level organizational units
- Ubiquitous: Found in many disciplines
- Nesting: Can involve two or more levels (e.g., Level 1 nested in Level 2, or Level 1 in Level 2 in Level 3)
Examples of Multilevel Data
- Repeated measurements nested within persons
- Students grouped in classrooms or schools
- Romantic partners paired within dyads
- Employees nested within organizations
- Survey respondents grouped in geographical regions
- Clients clustered within therapists
- Three-level example: Repeated measurements nested within students, students nested in schools
Data Analysis Focus
- Multilevel Regression Models: Specifically models with random effects
- These are very common tools for analyzing hierarchical data
- Recommended Resources:
- Gelman & Hill (2007)
- Hoffman (2015)
- Hox, Moerbeek, & Van de Schoot (2017)
- Raudenbush & Bryk (2002)
- Snijders & Bosker (2012)
- Verbeke & Molenberghs (2000)
Missing Data Handling in Multilevel Models
- Recent Development: Methods specifically for multilevel missing data are relatively new and important.
- Chapter Focus:
- Bayesian Estimation
- Model-Based Imputation (often more adept than ML)
- Joint Model Imputation
- Fully Conditional Specification (FCS / MICE)
- Maximum Likelihood (ML)
- Core Idea: Often involves MCMC estimating factored regressions (focal + supporting models) to create distributions for missing values. Imputations = Predictions + Noise (from a multilevel model)
Section 8.2: Random Intercept Regression Models
Example: Math Problem-Solving Study
- Data: Educational experiment data (from companion website)
- Structure: 2-Level Hierarchy
- Level 1: Students (\(n_j \approx 34\) per school)
- Level 2: Schools (\(J=29\))
- Design: Schools randomly assigned to experimental (new curriculum) or comparison (standard curriculum) condition
- Outcome (Y): End-of-year math problem-solving assessment (IRT scores, 37-65)
Key Feature: Multilevel Variation
- Variation and covariation exist at both levels
- Student scores vary within schools
- School average scores vary between schools
- Intraclass Correlation (ICC): Proportion of total variance at Level 2
- Example: ICC \(\approx 0.26\) for problem-solving scores means ~26% of variance is between schools
- Predictors: Level-1 predictors (like student math scores) can also have within- and between-group variation. Missing data methods need to preserve this.
Random Intercept Model: Concept
- Definition: A regression model where the intercept coefficient (\(\beta_{0j}\)) is allowed to vary randomly across Level-2 units (groups/clusters)
- Suitability: Amenable to various missing data handling methods (agnostic imputation, ML)
- Example Predictors:
- Level 1: Standardized math scores (
STANMATH
)
- Level 2: Teacher experience (
TEACHEXP
)
Level-1 Model (Within-Cluster)
Describes variation among students within the same school
\[
Y_{ij} = \beta_{0j} + \beta_{1}X_{1ij} + \epsilon_{ij}
\]
- \(Y_{ij}\): Outcome for student \(i\) in school \(j\)
- \(X_{1ij}\): Level-1 predictor value for student \(i\) in school \(j\)
- \(\beta_{0j}\): Random intercept for school \(j\) (Varies across schools)
- \(\beta_{1}\): Common (fixed) slope for \(X_1\) (Same across schools)
- \(\epsilon_{ij}\): Level-1 residual (within-school error) for student \(i\)
- Assumed \(\epsilon_{ij} \sim N(0, \sigma_{\epsilon}^2)\)
Level-2 Model (Between-Cluster)
Models the variation in the school-specific intercepts (\(\beta_{0j}\)).
\[
\beta_{0j} = \beta_{0} + \beta_{2}X_{2j} + b_{0j}
\]
- \(\beta_{0j}\): The random intercept (outcome in this model)
- \(\beta_{0}\): Grand mean intercept (average intercept across all schools)
- \(X_{2j}\): Level-2 predictor for school \(j\)
- \(\beta_{2}\): Effect of Level-2 predictor \(X_2\) on the intercept
- \(b_{0j}\): Level-2 residual (random effect) for school \(j\). Captures unexplained variation in intercepts.
- Assumed \(b_{0j} \sim N(0, \sigma_{b_0}^2)\)
Combined Model (Single Equation)
Substitute the Level-2 equation into the Level-1 equation:
\[
Y_{ij} = (\beta_{0} + b_{0j}) + \beta_{1}X_{1ij} + \beta_{2}X_{2j} + \epsilon_{ij}
\]
Alternatively:
\[
Y_{ij} = E(Y_{ij}|X_{1ij}, X_{2j}) + \epsilon_{ij}
\]
- Assumes \(Y_{ij} \sim N(E(Y_{ij}|X_{1ij}, X_{2j}), \sigma_{\epsilon}^2)\)
- Interpretation: Outcome scores \(Y_{ij}\) are normally distributed around the predicted values derived from their specific school’s regression line
Factored Regression Specification
- Concept: Expresses the joint distribution of all variables (\(Y, X_1, X_2\)) as a product of simpler conditional distributions using the probability chain rule. Essential for handling missing predictors
- Two Main Approaches:
- Sequential Specification
- Partially Factored Specification
1. Sequential Specification
Factorizes the joint distribution into a product of univariate conditional distributions:
\[
f(Y, X_1, X_2) = f(Y | X_1, X_2) \times f(X_1 | X_2) \times f(X_2)
\]
- \(f(Y | X_1, X_2)\): The focal analysis model
- \(f(X_1 | X_2)\): A model for the Level-1 predictor, conditional on Level-2 predictor(s)
- \(f(X_2)\): A model for the Level-2 predictor
- Important Ordering: Lower-level variables condition on higher-level variables
Sequential: Predictor Models
- Level-1 Predictor Model: A random intercept model for \(X_1\) \[ X_{1ij} = (\gamma_{01} + g_{01j}) + \gamma_{11}X_{2j} + r_{1ij} \]
- \(\gamma\): Regression coefficients
- \(g_{01j}\): Random intercept residual for \(X_1 \sim N(0, \sigma_{g_{01}}^2)\). Captures between-school variation in average \(X_1\)
- \(r_{1ij}\): Within-cluster residual for \(X_1 \sim N(0, \sigma_{r_1}^2)\). Captures within-school variation in \(X_1\)
- Level-2 Predictor Model: An “empty” single-level model for \(X_2\) (should be corresponding level-2 mean for level-1 variable) \[ X_{2j} = \gamma_{02} + r_{2j} \]
- \(\gamma_{02}\): Grand mean of \(X_2\)
- \(r_{2j}\): Between-cluster residual for \(X_2 \sim N(0, \sigma_{r_2}^2)\)
2. Partially Factored Specification
Factorizes into the focal model and a joint multivariate distribution for predictors:
\[
f(Y, X_1, X_2) = f(Y | X_1, X_2) \times f(X_1, X_2)
\]
- \(f(X_1, X_2)\): A two-part (Level-1 and Level-2) multivariate normal distribution
- Decomposition: Splits L1 predictor \(X_1\) into components: \[ X_{1ij} = \underbrace{\mu_{1}}_{\text{Grand Mean}} + \underbrace{(\mu_{1j} - \mu_{1})}_{\text{Between Cluster Dev.}} + \underbrace{(X_{1ij} - \mu_{1j})}_{\text{Within Cluster Dev.}} \]
- \(\mu_{1j}\) is the latent group mean for cluster \(j\)
Partially Factored: Predictor Models
- Within-Cluster Model for \(X_1\): \(X_1\) scores as deviations around “latent” (unestimated) group means \[ X_{1ij} = \mu_{1j} + r_{1ij(W)} \quad \text{where} \quad X_{1ij} \sim N(\mu_{1j}, \sigma_{r_{1(W)}}^2) \]
- \(r_{1ij(W)}\): Within-cluster residual
- Between-Cluster Model for \((\mu_{1j}, X_{2j})\): Latent means (\(\mu_{1j}\)) and L2 scores (\(X_{2j}\)) are bivariate normal. \[ \begin{pmatrix} \mu_{1j} \\ X_{2j} \end{pmatrix} = \begin{pmatrix} \mu_{1} \\ \mu_{2} \end{pmatrix} + \begin{pmatrix} r_{1j(B)} \\ r_{2j(B)} \end{pmatrix} \quad \text{where} \quad \begin{pmatrix} r_{1j(B)} \\ r_{2j(B)} \end{pmatrix} \sim N_2 \left( \mathbf{0}, \Sigma_{(B)} \right) \]
- \(\Sigma_{(B)}\) is the \(2 \times 2\) between-cluster covariance matrix
- Note: Can also be parameterized via round-robin regressions. Partially factored is ideal for models with centered predictors
Distribution of Missing Values: Outcome \(Y\)
- Defined solely by the focal analysis model (Combined Model)
- The posterior predictive distribution for \(Y_{ij(mis)}\) is: \[ Y_{ij(mis)} \sim N(E(Y_{ij}|X_{1ij}, X_{2j}), \sigma_{\epsilon}^2) \]
- Imputation: Draw a random value from this normal distribution
- The mean \(E(Y_{ij}|...)\) depends on the specific cluster \(j\)’s intercept (\(\beta_{0j}\))
- The variance \(\sigma_{\epsilon}^2\) is the within-cluster residual variance
Distribution of Missing Values: Predictor \(X_1\)
- Drawn from the conditional distribution \(f(X_1 | Y, X_2)\)
- This distribution is proportional to the product of the focal model and the predictor model for \(X_1\): \[ f(X_1 | Y, X_2) \propto f(Y | X_1, X_2) \times f(X_1 | X_2) \]
- Combining the kernels yields a normal distribution: \[ f(X_{1ij(mis)} | Y_{ij}, X_{2j}) = N(E(X_{1ij}|...), Var(X_{1ij}|...)) \]
- Mean and Variance: Depend on parameters from both the focal model (\(\beta\)’s, \(\sigma_{\epsilon}^2\)) and the predictor model (\(\mu_{1j}\), \(\sigma_{r_{1(W)}}^2\)). Includes random effects (\(\beta_{0j}, \mu_{1j}\))
Distribution of Missing Values: Predictor \(X_2\)
- Drawn from the conditional distribution \(f(X_2 | Y, X_1)\)
- Using the partially factored specification: \[ f(X_2 | Y, X_1) \propto f(Y | X_1, X_2) \times f(X_2 | X_1) \]
- \(f(Y | X_1, X_2)\) is from the focal model; \(f(X_2 | X_1)\) is from the between-cluster predictor model
- Important: \(X_2\) is constant within cluster \(j\). The focal model’s contribution is repeated \(n_j\) times: \[ f(X_{2j(mis)}|...) \propto \left[ \prod_{i=1}^{n_j} N(E(Y_{ij}|...), \sigma_{\epsilon}^2) \right] \times N(E(X_{2j}|\mu_{1j}), \sigma_{r_{2(B)}}^2) \]
- Imputation: Often requires sampling methods like Metropolis-Hastings due to complexity
MCMC Algorithm: Overview
- Posterior Distribution: Complex function describing joint probability of parameters, random effects, latent means, and missing values given observed data
- Core Logic: Extends standard Bayesian MCMC
- Estimate one unknown quantity at a time (parameter, latent variable, missing value)
- Condition on the current values of all other quantities
- Full conditional distributions are available in the literature
MCMC Algorithm: Generic Recipe
- Assign starting values (parameters, random effects, missing values)
- Do for \(t=1\) to T iterations:
- Estimate focal model’s parameters, given everything else
- Estimate focal model’s random effects, given everything else
- Estimate each predictor model’s parameters, given everything else
- Estimate each predictor model’s random effects, given everything else
- Impute dependent variable (\(Y_{mis}\)) given focal model parameters
- Impute each predictor (\(X_{mis}\)) given focal and supporting models
- **Repeat*
Analysis Example: Setup (Math Study)
- Model: Random intercept regression \[ PROBSOLVE_{ij} = (\beta_{0} + b_{0j}) + \beta_{1}PRETEST_{ij} + \beta_{2}STANMATH_{ij} \] \[ + \beta_{3}FRLUNCH_{ij} + \beta_{4}TEACHEXP_{j} + \beta_{5}CONDITION_{j} + \epsilon_{ij} \]
- Predictors:
- L1: PRETEST (complete), STANMATH (7.3% missing), FRLUNCH (binary, 4.7% missing)
- L2: TEACHEXP (10.3% missing), CONDITION (complete)
- Outcome: PROBSOLVE (20.5% missing)
- Goal: Estimate treatment effect (\(\beta_5\)) controlling for covariates
NOTE: Example analysis is likely incorrectly specified (no level-2 variables from corresponding level-1 variables likely means conflated effects)
Analysis Example: Factored Regression Options
- Sequential: Product of univariate distributions. Order: Incomplete L1, Complete L1, Incomplete L2, Complete L2. Use latent response for binary predictors (FRLUNCH).
- Partially Factored: Multivariate normal for predictors (PRETEST, STANMATH, FRLUNCH, TEACHEXP, CONDITION)
- Level 1 predictors decomposed
- Within-cluster model: Correlated deviations around latent group means (\(\mu_j\))
- Between-cluster model: Latent group means + L2 vars are multivariate normal
- Preferred here due to centering
Random Coefficient Models
Example: Daily Diary Study (Health Psych)
- Data: \(J=132\) participants with chronic pain
- Structure: 2-Level Hierarchy (Repeated Measures)
- Level 1: Daily assessments (up to \(n_j=21\) days: mood, sleep, pain)
- Level 2: Persons
- Variables: L1 (daily), L2 (person-level demographics, psych variables like pain acceptance, catastrophizing)
- ICC Example: Positive Affect ICC \(\approx 0.63\). High between-person variation typical for repeated measures
Random Coefficient (Slope) Model: Concept
- Definition: A multilevel regression where the influence (slope) of one or more Level-1 predictors varies across Level-2 units
- Example:
- L1: Daily
PAIN
(\(X_1\)) predicting daily POSAFFECT
(\(Y\))
- L2: Person-average
PAIN
(\(\mu_{1j}\)) and PAINACCEPT
(\(X_2\)) predicting average POSAFFECT
- Key: The effect of daily
PAIN
on POSAFFECT
can differ from person to person
Level-1 Model (Within-Person)
Describes daily variation within the same person.
\[
Y_{ij} = \beta_{0j} + \beta_{1j}(X_{1ij} - \mu_{1j}) + \epsilon_{ij}
\]
- \(Y_{ij}\): Positive affect for person \(j\) on day \(i\)
- \(X_{1ij}\): Pain rating for person \(j\) on day \(i\)
- \(\beta_{0j}\): Random intercept for person \(j\) (person \(j\)’s average affect)
- \(\beta_{1j}\): Random slope for person \(j\) (person \(j\)’s daily pain-affect association)
- \(\mu_{1j}\): Latent person-mean for \(X_1\) (pain). \(X_1\) is group-mean centered
- \(\epsilon_{ij}\): Level-1 residual (\(\sim N(0, \sigma_{\epsilon}^2)\))
Interpretation: Regression lines vary in intercept and slope across persons
Level-2 Model (Between-Person)
Models variation in person-specific intercepts (\(\beta_{0j}\)) and slopes (\(\beta_{1j}\))
\[
\beta_{0j} = \beta_{0} + \beta_{2}(\mu_{1j} - \mu_{1}) + \beta_{3}(X_{2j} - \mu_{2}) + b_{0j}
\] \[
\beta_{1j} = \beta_{1} + b_{1j}
\]
- \(\beta_0, \beta_1\): Grand mean intercept & slope
- \(\mu_{1j}\): Latent person-mean pain (predictor for \(\beta_{0j}\))
- \(X_{2j}\): Pain acceptance (predictor for \(\beta_{0j}\))
- \(\mu_1, \mu_2\): Grand means for predictors (used for centering)
- \(\beta_2, \beta_3\): Fixed effects of L2 predictors on intercept
- \(b_{0j}, b_{1j}\): L2 random effects (residuals for intercept & slope). Assumed \(\begin{pmatrix} b_{0j} \\ b_{1j} \end{pmatrix} \sim N_2(\mathbf{0}, \Sigma_b)\)
Missing Data Challenge: Nonlinearity
- Random coefficient models feature a product term: \(X_{1ij} \times \beta_{1j}\)
- Level-1 predictor (\(X_1\)) multiplied by a Level-2 latent variable/random effect (\(\beta_{1j}\))
- Problem: Handling missingness when the L1 predictor (\(X_1\)) involved in the random slope is incomplete is challenging
- Some methods (e.g., current ML estimators) are prone to substantial bias
- Solution: Bayesian estimation / Model-based MI using factored regression is effective
Factored Regression Specification
- Partially Factored: Ideal due to centering \[ f(Y | X_1, X_2) \times f(X_1, X_2) \]
- The predictor model \(f(X_1, X_2)\) is specified as before
- The fact that the focal model \(f(Y|...)\) has a random slope doesn’t change the form of the predictor model specification
Distribution of Missing Values: Outcome \(Y\)
- Defined solely by the focal analysis model
- Posterior predictive distribution for \(Y_{ij(mis)}\): \[ Y_{ij(mis)} \sim N(E(Y_{ij}|X_{1ij}, X_{2j}), \sigma_{\epsilon}^2) \]
- Imputation: Draw from this normal distribution
- Mean \(E(Y_{ij}|...)\) incorporates person \(j\)’s specific intercept (\(\beta_{0j}\)) AND slope (\(\beta_{1j}\))
Distribution of Missing Values: Predictor \(X_1\)
- Drawn from conditional distribution \(f(X_1 | Y, X_2) \propto f(Y | X_1, X_2) \times f(X_1 | X_2)\)
- Combining kernels yields a normal distribution: \[ f(X_{1ij(mis)} | Y_{ij}, X_{2j}) = N(E(X_{1ij}|...), Var(X_{1ij}|...)) \]
- \(E(X_{1ij}|...) = Var(X_{1ij}|...) \times \left( \frac{\mu_{1j}}{\sigma_{r_{1(W)}}^2} + \frac{\beta_{1j}(Y_{ij} - (\beta_{0j} + \beta_{2}(\mu_{1j}-\mu_1) + ...))}{\sigma_{\epsilon}^2} \right)\)
- \(Var(X_{1ij}|...) = \left( \frac{1}{\sigma_{r_{1(W)}}^2} + \frac{\beta_{1j}^2}{\sigma_{\epsilon}^2} \right)^{-1}\)
Key Issue: Heteroscedasticity in \(X_1\) Imputations
- Look at the variance term for \(X_{1ij(mis)}\): \[ Var(X_{1ij}|Y_{ij}, X_{2j}) = \left( \frac{1}{\sigma_{r_{1(W)}}^2} + \frac{\boldsymbol{\beta_{1j}^2}}{\sigma_{\epsilon}^2} \right)^{-1} \]
- The variance (spread) of the imputation distribution depends on the person-specific random slope squared (\(\beta_{1j}^2\))
- Implication: Imputations for \(X_1\) should have different variances for different people. This is heteroscedastic
- Problem: Methods assuming multivariate normality for imputation (e.g., standard FCS, current ML in SEM) cannot capture this heteroscedasticity easily and are prone to bias (especially underestimating slope variance)
Analysis Example: Setup (Diary Study)
- Model: Random coefficient model for POSAFFECT \[ POSAFFECT_{ij} = \beta_{0j} + \beta_{1j}(PAIN_{ij} - \mu_{1j}) + \beta_{2}(SLEEP_{ij} - \mu_{2}) \] \[ + \beta_{3}(\mu_{1j} - \mu_{1}) + \beta_{4}(PAINACCEPT_{j} - \mu_{2}) + \beta_{5}FEMALE_{j} + \epsilon_{ij} \]
- Predictors:
- L1: PAIN (random slope, latent group-mean centered), SLEEP (fixed slope, grand-mean centered)
- L2: Mean PAIN (\(\mu_{1j}\)), PAINACCEPT, FEMALE (binary)
- Factored Regression: Partially factored, includes SLEEP, PAINACCEPT, FEMALE* in predictor model \(f(...)\)
Section 8.4: Multilevel Interaction Effects
Example: Employee Empowerment Study
- Data: \(N=630\) employees from \(J=105\) workgroups/teams (\(n_j=6\))
- Structure: 2-Level Hierarchy
- Level 1: Employees
- Level 2: Workgroups/Teams
- Outcome: Employee Empowerment (\(Y\)). ICC \(\approx 0.11\)
- Variables: LMX (Leader-Member Exchange), Gender (MALE), Team Cohesion, Team Leadership Climate
Cross-Level Interaction: Concept
- Definition: The influence (slope) of a Level-1 predictor (\(X_1\)) on \(Y\) is moderated by a Level-2 predictor (\(X_2\))
- Often involves random slopes for \(X_1\), as the interaction term helps explain why slopes vary across groups
- Example:
- L1 Predictor: LMX (within-team relationship quality)
- L2 Moderator: CLIMATE (team leadership climate)
- Interaction: Does the LMX-Empowerment relationship depend on team Climate?
Model Specification
\[
Y_{ij} = \beta_{0j} + \beta_{1j}(X_{1ij} - \mu_{1j}) + \beta_{2}(X_{2ij}^* - \mu_{2j}) + \beta_{3}(X_{3j} - \mu_{3})
\] \[
+ \beta_{4}(X_{4j} - \mu_{4}) + \boldsymbol{\beta_{5}(X_{1ij} - \mu_{1j})(X_{4j} - \mu_{4})} + \epsilon_{ij}
\]
- \(Y_{ij}\): Empowerment
- \(X_{1ij}\): LMX (L1 predictor, random slope \(\beta_{1j}\), latent group-mean centered)
- \(X_{2ij}^*\): MALE (L1 covariate, latent group-mean centered)
- \(X_{3j}\): COHESION (L2 covariate, grand-mean centered)
- \(X_{4j}\): CLIMATE (L2 moderator, grand-mean centered)
- \(\beta_5\): Cross-level interaction coefficient
Factored Regression Specification
- Partially Factored: Preferred due to latent mean centering \[ f(Y | Xs, Interaction) \times f(X_1, X_2^*, X_3, X_4) \]
- Predictor Model \(f(...)\):
- L1 vars (\(X_1, X_2^*\)) decomposed into within/between components
- Within-Cluster: Models \((X_{1ij}, X_{2ij}^*)\) around latent group means \((\mu_{1j}, \mu_{2j})\) using \(\Sigma_{(W)}\). \(X_2^*\) is latent response for MALE
- Between-Cluster: Models \((\mu_{1j}, \mu_{2j}, X_{3j}, X_{4j})\) around grand means using \(\Sigma_{(B)}\)
Distribution of Missing Values
- Similar to Random Coefficient Model:
- Missing \(Y\): Drawn from focal model, involves \(\beta_{0j}, \beta_{1j}\)
- Missing Predictors (e.g., \(X_1\) = LMX):
- Conditional distribution depends on focal and predictor models
- Heteroscedasticity: Variance depends on random slope (\(\beta_{1j}^2\)) and interaction term involving \(\beta_5\)
- Requires methods that handle this (Bayesian/Model-Based MI). Metropolis-Hastings useful
Example: Educational Study Revisited (3 Levels)
- Data: Cluster-randomized trial from Sec 8.2, but now using longitudinal data
- Structure: 3-Level Hierarchy
- Level 1: Measurement Occasions (\(t=1...7\), monthly)
- Level 2: Students (\(i\))
- Level 3: Schools (\(j\))
- Outcome: Problem Solving score at each occasion (\(PROBSOLVE_{tij}\))
- Missing Data: Increased over time (~20% by final wave). Planned missingness in control group
Longitudinal Growth Curve Model
- Concept: Type of MLM where repeated measures are modeled as a function of time
- Time Predictor (
MONTH
): Codes passage of time (L1 predictor)
- Example coding: Relative to final assessment (\(MONTH = -6, -5, ..., 0\)). Intercept is end-of-year score
- Time Variation:
MONTH
only varies within-student (L1). Constant across students and schools (No L2 or L3 variance)
Level-1 Model (Within-Student)
Models individual change trajectories.
\[
PROBSOLVE_{tij} = \beta_{0ij} + \beta_{1ij}(MONTH_{tij}) + \epsilon_{tij}
\]
- \(\beta_{0ij}\): Student \(i\)’s expected score at MONTH=0 (end-of-year)
- \(\beta_{1ij}\): Student \(i\)’s linear rate of change per month
- Both \(\beta_{0ij}\) and \(\beta_{1ij}\) are random effects varying across students (L2) and schools (L3)
- \(\epsilon_{tij}\): Time-specific residual (\(\sim N(0, \sigma_{\epsilon}^2)\))
Level-2 Model (Between-Student)
Models student-level variation in intercepts (\(\beta_{0ij}\)) and slopes (\(\beta_{1ij}\))
\[
\beta_{0ij} = \beta_{0j} + \beta_{2}(STANMATH_{ij} - \mu_{2}) + \beta_{3}(FRLUNCH_{ij}^* - \mu_{3}) + b_{0ij}
\] \[
\beta_{1ij} = \beta_{1j} + b_{1ij}
\]
- \(\beta_{0j}, \beta_{1j}\): School \(j\)’s average intercept & slope
- \(STANMATH_{ij}, FRLUNCH_{ij}^*\): Student-level (L2) covariates predicting intercept (grand-mean centered)
- \(b_{0ij}, b_{1ij}\): Student-level random effects (deviations from school means). \(\sim N_2(\mathbf{0}, \Sigma_{b(L2)})\)
Level-3 Model (Between-School)
Models school-level variation in intercepts (\(\beta_{0j}\)) and slopes (\(\beta_{1j}\))
\[
\beta_{0j} = \beta_{0} + \beta_{4}(TEACHEXP_{j} - \mu_{4}) + \beta_{5}(CONDITION_{j}) + b_{0j}
\] \[
\beta_{1j} = \beta_{1} + \beta_{6}(CONDITION_{j}) + b_{1j}
\]
- \(\beta_0, \beta_1\): Grand mean intercept & slope (for control group, CONDITION=0)
- \(TEACHEXP_j, CONDITION_j\): School-level (L3) covariates
- \(\beta_5\): Intercept difference for intervention group (at MONTH=0)
- \(\beta_6\): Slope difference for intervention group (Treatment \(\times\) Time interaction)
- \(b_{0j}, b_{1j}\): School-level random effects. \(\sim N_2(\mathbf{0}, \Sigma_{b(L3)})\)
Factored Regression Specification
- Partially Factored: \(f(Y | Xs) \times f(Xs)\)
- Decomposition: L1 predictors decomposed into L1(within-L2), L2(within-L3), and L3 components \[ X_{1tij} = \mu_1 + (\mu_{1j}-\mu_1) + (\mu_{1ij}-\mu_{1j}) + (X_{1tij}-\mu_{1ij}) \]
- (Note: This formula is conceptual; specific models depend on variance components)
Factored Regression: Predictor Models
- MONTH Predictor: Only L1 variation. Modeled as deviation from grand mean. No L2/L3 random effects needed if complete \[ MONTH_{tij} = \mu_{1} + r_{1tij(W)} \]
- Level-2 Predictor Model: Models L2 vars (STANMATH, FRLUNCH*) around L3 latent means (\(\mu_{2j}, \mu_{3j}\)) using \(\Sigma_{(L2)}\) \[ X_{ij(L2)} \sim N_2(\boldsymbol{\mu}_j, \Sigma_{(L2)}) \]
- Level-3 Predictor Model: Models L3 latent means (\(\mu_{2j}, \mu_{3j}\)) + L3 vars (TEACHEXP, CONDITION*) around grand means using \(\Sigma_{(L3)}\) \[ X_{j(L3)} \sim N_4(\boldsymbol{\mu}, \Sigma_{(L3)}) \]
Multiple Imputation Strategies
Recap: Agnostic vs. Model-Based MI
- Agnostic Imputation: Imputation model differs from analysis model (e.g., Joint Modeling, FCS)
- Multilevel extensions exist
- Generally suitable for random intercept models
- Model-Based Imputation: Imputation model is tailored to (or is the same as) the analysis model (e.g., from Bayesian estimation)
- Essential for models with random coefficients, interactions, or other nonlinearities to avoid bias
Caution: Single-Level MI on Multilevel Data
- Applying standard (single-level) JM or FCS to multilevel data is problematic
- Why? These methods ignore the data hierarchy (nesting)
- They produce imputed values with no between-cluster variation
- Leads to biased estimates (e.g., attenuated L2 effects, incorrect SEs)
- Use multilevel versions of JM/FCS or model-based approaches instead
Fixed Effect Imputation: Concept
- An alternative strategy in some situations
- Method:
- Create dummy variables (\(D_k\)) for each Level-2 group (\(k=1...J\))
- Include these dummy variables as predictors in a single-level imputation model
- Effectively treats group membership as a fixed effect during imputation
Fixed Effect Imputation: When to Consider?
- May be useful when:
- The number of clusters (\(J\)) is very small (makes MLM estimation hard)
- Level-2 groups are not considered a random sample from a larger population
- Between-cluster differences are viewed as nuisance variation to be controlled, not phenomena of interest
Fixed Effect Imputation: Model
Example for imputing outcome \(Y\) in a random intercept context:
\[
Y_{ij} = \sum_{k=1}^{J} \gamma_{k} D_{kj} + \gamma_{J+1} X_{1ij} + \epsilon_{ij}
\]
- \(D_{kj}=1\) if unit \(i\) is in group \(k\), 0 otherwise
- \(\gamma_k\) is the estimated intercept for group \(k\)
- Uses absolute coding (all \(J\) dummies, no overall intercept \(\beta_0\))
- Note: Excludes L2 predictors (e.g., \(X_{2j}\)) because the dummy codes account for all between-group variance in \(Y\)
Fixed Effect Imputation: Limitations
- Computationally simple
- Potential Biases:
- Can overcompensate for group differences, exaggerating between-group variation (Lüdtke et al., 2017). Bias worse with low ICC or small \(n_j\).
- May produce positively biased standard errors and inaccurate confidence intervals (Andridge, 2011; van Buuren, 2011)
- Practical Limitation: Difficult to extend beyond random intercept models. Preserving random slopes would require many product terms (\(D_{kj} \times X_{1ij}\))
Joint Model (JM) Imputation
JM Imputation: Overview
- Framework: Uses a multivariate normal distribution for continuous & latent response variables, decomposed into within- & between-cluster parts
- Approach: Often uses an “empty” multivariate model (all variables as outcomes) for imputation
- Handles: Missing data at L1/L2, categorical variables (via latent response)
- Standard JM Limitation: Assumes common within-cluster covariance (\(\Sigma_{(W)}\)) across groups. Suitable for random intercept models, but biased for random slope models
JM Model: Within-Cluster
Models L1 scores as correlated deviations around latent group means \(\boldsymbol{\mu}_j\)
\[
\mathbf{Y}_{ij(W)} = \begin{pmatrix} PROBSOLVE_{ij} \\ PRETEST_{ij} \\ STANMATH_{ij} \\ FRLUNCH_{ij}^* \end{pmatrix} = \boldsymbol{\mu}_j + \mathbf{r}_{ij(W)}
\]
- Assumes \(\mathbf{Y}_{ij(W)} \sim N_4(\boldsymbol{\mu}_j, \boldsymbol{\Sigma_{(W)}})\)
- \(\boldsymbol{\Sigma_{(W)}}\) is the common within-cluster covariance matrix
- \(FRLUNCH^*\) is latent response variable (variance fixed to 1)
- Defines posterior predictive distribution for L1 missing values
JM Model: Between-Cluster
Models latent group means (\(\boldsymbol{\mu}_j\)) and L2 variables
\[
\mathbf{Y}_{j(B)} = \begin{pmatrix} \mu_{1j} \\ \mu_{2j} \\ \mu_{3j} \\ \mu_{4j} \\ TEACHEXP_j \\ CONDITION_j^* \end{pmatrix} = \boldsymbol{\mu} + \mathbf{r}_{j(B)}
\]
- Assumes \(\mathbf{Y}_{j(B)} \sim N_6(\boldsymbol{\mu}, \boldsymbol{\Sigma_{(B)}})\)
- \(\boldsymbol{\Sigma_{(B)}}\) is the between-cluster covariance matrix
- \(CONDITION^*\) is latent response variable (variance fixed to 1)
- Defines posterior predictive distribution for L2 missing values & latent means
JM MCMC Algorithm
Generates imputations using Gibbs sampling:
- Initialize: Parameters (\(\boldsymbol{\mu}, \Sigma_{(W)}, \Sigma_{(B)}\)), latent means (\(\boldsymbol{\mu}_j\)), missing values
- Iterate (t=1…T):
- Estimate grand means \(\boldsymbol{\mu}\)
- Estimate latent group means \(\boldsymbol{\mu}_j\)
- Estimate between-cluster covariance \(\boldsymbol{\Sigma_{(B)}}\)
- Estimate within-cluster covariance \(\boldsymbol{\Sigma_{(W)}}\)
- Impute missing values (using conditional MVN distributions derived from current parameters)
- Repeat for M parallel chains or burn-in/thinning
JM Extension: Random Within-Cluster Covariances
- Motivation: Standard JM assumes common \(\Sigma_{(W)}\), problematic for random slopes
- Solution (Yucel, 2011): Allow \(\Sigma_{(W)}\) to vary across L2 units (\(j\)) \[ \mathbf{Y}_{ij(W)} \sim N_k(\boldsymbol{\mu}_j, \boldsymbol{\Sigma_{j(W)}}) \]
- Between-Cluster Model: Remains the same
- Modeling \(\Sigma_{j(W)}\): Treat each \(\Sigma_{j(W)}\) as drawn from a common Wishart distribution (multivariate generalization of \(\chi^2\)).
- Wishart defined by pooled degrees of freedom & scale matrix
Random Covariance MCMC
- Similar to standard JM MCMC, but adds steps within the loop:
- Estimate pooled scale matrix (average \(\Sigma_{j(W)}\) structure)
- Estimate pooled degrees of freedom (related to average \(n_j\))
- Estimate cluster-specific \(\boldsymbol{\Sigma_{j(W)}}\) based on its data and borrowing strength via the Wishart prior
- Allows imputation even if variables are fully missing within some clusters
Fully Conditional Specification (FCS / MICE) Imputation
FCS Imputation: Overview
- Concept: Extends single-level FCS (MICE) to multilevel data (van Buuren, 2011)
- Method: Imputes variables one at a time using univariate regression models, conditional on all other variables. Cycles through variables iteratively
- Handles: L1, L2, L3 variables. Different model types (linear, logistic, etc.) per variable
- Standard FCS Limitation: Like JM, generally limited to random intercept models due to how random slopes are handled (or not handled)
Standard FCS: L1 Imputation Models
Uses random intercept regressions for L1 variables. Example :
- Continuous \(Y_{ij}\) (e.g., PROBSOLVE): \[ Y_{ij}^{(t)} = \gamma_{01j} + \gamma_{11}X_{1ij}^{(t-1)} + ... + r_{1ij} \]
- Binary \(Y_{ij}\) (e.g., FRLUNCH): \[ ln\left(\frac{Pr(Y_{ij}^{(t)}=1)}{1-Pr(Y_{ij}^{(t)}=1)}\right) = \gamma_{03j} + \gamma_{13}X_{1ij}^{(t)} + ... \]
- Impute \(Y_{ij(mis)}^{(t)}\) by drawing from posterior predictive distribution (Normal or Binomial). \((t)\) indicates iteration
Standard FCS: L2 Imputation Model
Uses single-level regression with cluster means (\(\bar{X}\)) of L1 vars as predictors
\[
X_{2j}^{(t)} = \gamma_{04} + \gamma_{14}\bar{Y}_{1j}^{(t)} + \gamma_{24}\bar{X}_{1j}^{(t)} + ... + \gamma_{54}X_{L2,j} + r_{04j}
\]
- \(\bar{Y}_{1j}^{(t)}, \bar{X}_{1j}^{(t)}\) are arithmetic averages of (imputed) L1 variables within cluster \(j\) at iteration \(t\)
- Impute \(X_{2j(mis)}^{(t)}\) by drawing from its posterior predictive distribution
Standard FCS: Limitations
- Common Slope Assumption: Standard specification doesn’t allow within- vs. between-cluster associations to differ (unlike JM)
- Workaround: Add cluster means as predictors in L1 models (mimics JM)
- Assumes Equal Cluster Sizes (\(n_j\)): Using arithmetic means (\(\bar{X}\)) in L2 models ignores differential reliability due to varying \(n_j\). Biases tend to be small unless ICC or \(n_j\) very small (Grund et al., 2017)
FCS with Latent Variables: Overview
- Alternative Formulation: Addresses limitations of standard FCS (Enders et al., 2018; Keller & Enders, 2021)
- Advantages:
- Equivalent to Joint Model (JM) specification
- Naturally handles unequal cluster sizes (\(n_j\))
- Uses latent response variables for categorical data
- Uses latent group means (\(\mu_j\)) instead of arithmetic means (\(\bar{X}\))
FCS Latent: Within-Cluster Regressions
Reparameterizes within-cluster MVN distribution (\(\Sigma_{(W)}\)) as round-robin regressions
\[
Y_{1ij}^{(t)} = \mu_{1j} + \gamma_{11(W)}(Y_{2ij}^{(t-1)}-\mu_{2j}) + ... + r_{1ij(W)}
\] \[
Y_{2ij}^{(t)} = \mu_{2j} + \gamma_{12(W)}(Y_{3ij}^{(t-1)}-\mu_{3j}) + ... + r_{2ij(W)}
\] … (one equation per L1 variable)
- Regressors are centered at latent group means (\(\mu_j\))
- Intercept is the latent group mean of the variable being imputed
- Uses latent response variables (\(Y^*\)) for binary/ordinal data
FCS Latent: Between-Cluster Regressions
Reparameterizes between-cluster MVN distribution (\(\Sigma_{(B)}\)) as round-robin regressions
\[
\mu_{1j}^{(t)} = \mu_{1} + \gamma_{11(B)}(\mu_{2j}^{(t-1)}-\mu_{2}) + ... + \gamma_{51(B)}(Y_{6j}^{*(t-1)}-\mu_{6}) + r_{1j(B)}
\] … (one equation per latent mean \(\mu_j\) and L2 variable \(Y_k\))
- Models latent means (\(\mu_j\)) and L2 variables (\(Y_k\), possibly latent \(Y_k^*\))
- Regressors are centered at grand means (\(\mu\))
FCS Latent: Imputing Latent Means (\(\mu_j\))
- Latent means (\(\mu_j\)) are treated like missing data and updated each iteration
- Drawn from complex conditional posterior distribution: \[ f(\mu_{1j}|...) \propto \left[ \prod_{i=1}^{n_j} N(E(Y_{1ij}|...), \sigma_{Y_{1(W)}}^2) \right] \times N(E(\mu_{1j}|...), \sigma_{r_{1(B)}}^2) \]
- Distribution depends on L1 data within cluster \(j\) (weighted by \(n_j\)) AND L2 model
- Explicitly accounts for unequal cluster sizes \(n_j\)
FCS Limitation: Random Coefficients
- Standard FCS approaches are generally not suitable for models with random coefficients (slopes)
- Why? The way FCS imputes the L1 predictor involved in the random slope is often incompatible with the analysis model
Problem: Reverse Random Coefficient Imputation
- Consider analysis model: \(Y_{ij} = \beta_{0j} + \beta_{1j}X_{ij} + \epsilon_{ij}\)
- A seemingly plausible FCS imputation model for \(X_{ij}\) might be: \[ X_{ij} = \gamma_{0j} + \gamma_{1j}Y_{ij} + r_{ij} \] (i.e., a random coefficient model predicting \(X\) from \(Y\))
- Incompatibility: These two models are logically inconsistent unless the random slope variance (\(\sigma^2_{\beta_1}\)) is zero. They cannot arise from the same underlying joint distribution
Issue: Heteroscedasticity Ignored
- Recall the correct conditional distribution for \(X_{1ij(mis)}\) from Bayesian perspective: \[ Var(X_{1ij}|Y_{ij}, ...) = \left( \frac{1}{\sigma_{r_{1(W)}}^2} + \frac{\boldsymbol{\beta_{1j}^2}}{\sigma_{\epsilon}^2} \right)^{-1} \]
- The variance depends on the cluster-specific \(\beta_{1j}^2\) (heteroscedastic)
- The reverse regression model incorrectly assumes a constant residual variance (\(\sigma_r^2\)) for \(X_{ij}\) across all clusters \(j\). It fails to capture the necessary heteroscedasticity
Consequences & Recommendation
- Bias: Using standard FCS (incl. reverse random coefficient models) when the analysis model has random slopes leads to bias, particularly underestimation of the random slope variance.
- Connection: Similar issue to “just-another-variable” imputation for interaction models (Sec 5.4). Random slope models are a type of interaction (\(X_{1ij} \times \beta_{1j}\))
- Recommendation: For random coefficient models, use Bayesian estimation or Model-Based MI derived from the correctly specified factored regression model. Avoid standard FCS
Maximum Likelihood (ML) Estimation
ML Overview
- Currently arguably less capable than Bayesian/MI for complex multilevel missing data problems (e.g., random slopes with missing predictors)
- Handles a more limited set of scenarios effectively
ML: Incomplete Outcomes Only
- Handled by standard mixed model software (e.g.,
lme4
, nlme
, SAS PROC MIXED, SPSS MIXED)
- If missingness depends only on observed predictors (MAR), analyzing the observed \(Y\) values gives valid ML estimates
- No imputation needed; equivalent to complete-data estimation with unbalanced cluster sizes (\(n_j\))
ML: Incomplete Predictors - Challenges
- Many standard MLM packages default to listwise deletion if predictors are missing
- Assumes MCAR (often unrealistic)
- Reduces sample size, potentially discarding entire clusters
- Need specialized ML approaches
ML Approach 1: HLM Software
- Method: Shin & Raudenbush (2007, 2013) approach implemented in HLM software
- Assumptions: Incomplete predictors are multivariate normal (MVN)
- Limitations:
- No capacity for incomplete categorical predictors
- No capacity for random slopes between incomplete L1 variables
- Mechanism: Reparameterizes MLM into within/between MVN components (like JM). Uses EM algorithm to estimate means/covariances, transforms back to regression parameters
ML Approach 2: Multilevel SEM
- Method: Use Full Information Maximum Likelihood (FIML) within a Structural Equation Modeling framework
- Assumption: Typically assumes incomplete variables are MVN
- Capability: Can specify models with incomplete random slope predictors
- Limitation: Prone to substantial bias when predictors in random slopes are missing (Enders et al., 2018, 2020). Fails to account for heteroscedasticity
- Recommendation: Best suited for random intercept models when predictors are missing (assuming MVN)
Multilevel SEM Framework
- Recap SEM (Ch 3): Models individual data vector \(\mathbf{Y}_i \sim N(\boldsymbol{\mu}(\theta), \boldsymbol{\Sigma}(\theta))\). FIML uses available data for each case
- Multilevel SEM: Unit of analysis is the cluster (\(j\)). \(\mathbf{Y}_j\) is the vector of all L1 observations for cluster \(j\)
- \(\mathbf{Y}_j = (Y_{1j}, Y_{2j}, ..., Y_{n_j, j})'\)
- Assumes \(\mathbf{Y}_j\) follows a multivariate normal distribution with a structured mean vector \(\boldsymbol{\mu}_j(\theta)\) and covariance matrix \(\boldsymbol{\Sigma}_j(\theta)\) derived from the MLM parameters \(\theta\)
Summary
- Multilevel Data: Characterized by hierarchical nesting (e.g., measurements within persons, students within schools), leading to variance and covariance at multiple levels
- Models Discussed: The chapter covers random intercept, random coefficient (random slope), cross-level interaction, and three-level (e.g., growth curve) models. Centering strategies (latent group-mean, grand-mean) are crucial
- Core Missing Data Challenge: Handling missing predictors, especially Level-1 predictors involved in random slopes or interactions, is complex due to induced heteroscedasticity in their conditional distributions