Missing Data Methods

A Brief Introduction

Jonathan Templin

About This Lecture

Note

This lecture is a broad overview only. Missing data is a deep field with a rich methodological literature. Our goal today is to build intuition, introduce vocabulary, and demonstrate two modern workflows — not to provide exhaustive technical coverage.

“The problem of missing data is one of the most pervasive in the practice of statistics.” — Little & Rubin (2002)

Roadmap

Why missing data matters — it’s not just inconvenience
Taxonomy — MCAR, MAR, MNAR
The old (wrong) ways — what not to do and why
Modern Method 1: Full Information Maximum Likelihood (FIML) for path models
Modern Method 2: Multiple Imputation by Chained Equations (MICE)
Practical guidance and wrap-up

Why Missing Data Matters

Missing Data Is Everywhere

Real data is almost never complete. Common sources:

Survey non-response (certain groups less likely to respond)
Attrition in longitudinal studies
Instrument or sensor failure
Routing in adaptive questionnaires
Merging data from multiple sources

The danger: Most software defaults to complete-case analysis — silently dropping rows with any missing values.

Two Consequences of Mishandling Missing Data

Bias — estimates that are systematically wrong
Inefficiency — throwing away information, inflating standard errors

A Simple Motivating Example

Even a moderate amount of non-random missingness visibly biases our estimate.

Taxonomy of Missing Data

The Three Mechanisms (Rubin, 1976)

Let \(Y\) be the full data matrix and \(R\) be the missingness indicator (1 = observed, 0 = missing).

Mechanism	Definition	Intuition
MCAR	\(P(R \mid Y) = P(R)\)	Missingness is random coin flip, unrelated to any data

The Three Mechanisms (Rubin, 1976)

Mechanism	Definition	Intuition
MAR	\(P(R \mid Y) = P(R \mid Y_{obs})\)	Missingness depends on observed data, not on the missing values themselves

The Three Mechanisms (Rubin, 1976)

Mechanism	Definition	Intuition
MNAR	\(P(R \mid Y) = P(R \mid Y_{obs}, Y_{mis})\)	Missingness depends on the missing values themselves

Important

MCAR ⊂ MAR. MCAR is a special case of MAR. MNAR is the hardest — and most realistic — case.

MCAR — Missing Completely at Random

Definition: The probability of missingness is the same for all units, regardless of any data values.

Proportion missing: 0.3

Mean of x for observed y:   -0.037

Mean of x for missing y:    0.015

Under MCAR, observed and missing groups look similar on all covariates.

Test: Little’s MCAR test (naniar::mcar_test() in R) — though low power in small samples.

MAR — Missing at Random

Definition: Missingness depends on observed variables, but NOT on the variable that is missing.

Proportion missing: 0.487

Mean of x for observed y:   -0.509

Mean of x for missing y:    0.493

X predicts missingness — but once we condition on X, missingness is random.

Key implication: Methods that use the full covariate structure (FIML, MICE) can recover unbiased estimates under MAR.

MNAR — Missing Not at Random

Definition: Missingness depends on the unobserved values themselves, even after conditioning on all observed data.

Proportion missing: 0.473

Mean of OBSERVED y:   -1.266

Mean of TRUE y:       -0.06

Warning

MNAR is the hardest case. Neither FIML nor MICE provides a general solution. Addressing MNAR requires sensitivity analyses and explicit models for the missingness mechanism.

Summary: The Three Mechanisms

The Early (Wrong) Ways

Approach 1: Listwise (Complete-Case) Deletion

What it does: Drop any observation with a missing value on any variable in the model.

Original N: 10

After listwise deletion N: 5

Percent retained: 50 %

Problems:

Requires MCAR for unbiased estimates — rarely justified
Loses substantial power; in wide datasets with many variables, near-total loss of data is possible
Dropping rows changes the sample — who is left?

Approach 2: Pairwise Deletion

What it does: Use all available observations for each pairwise calculation (e.g., each cell of a correlation matrix).

     x1    x2    x3
x1 1.00 0.510 0.070
x2 0.51 1.000 0.384
x3 0.07 0.384 1.000

Problem: Each correlation uses a different subset of the data.

The resulting correlation matrix may not be positive semi-definite — it can be mathematically impossible.
SEM/path models will fail or produce nonsensical results.

Approach 3: Mean Imputation

What it does: Replace each missing value with the variable’s observed mean.

True SD:          1.949

Observed SD:      1.949

After mean imp SD: 1.629

Problems:

Artificially deflates variance — attenuates all correlations and regression coefficients
Distorts the distribution (spike at the mean)
Standard errors are wrong — the imputed values are treated as real observations

Approach 4: Single (Regression) Imputation

What it does: Predict missing values from a regression on observed variables. Slightly better than mean imputation, but still flawed.

True residual SD:     1.421

Imputed residual SD:  1.158

Problem: Single imputation replaces each missing value with a point estimate — no residual noise.

This understates uncertainty and inflates correlations (imputed values lie perfectly on the regression line).

Full Information Maximum Likelihood (FIML)

How FIML Works (Conceptually)

Under FIML (also called direct ML or raw ML):

Each observation contributes to the likelihood using only its observed variables
No data are deleted or imputed
The full model is estimated simultaneously with the missingness

How FIML Works (Conceptually)

The likelihood for observation \(i\) with observed variables \(Y_i^{obs}\):

\[\ell_i(\theta) = -\frac{1}{2} \left[ \log |\Sigma_i| + (Y_i^{obs} - \mu_i)^\top \Sigma_i^{-1} (Y_i^{obs} - \mu_i) \right]\]

where \(\mu_i\) and \(\sigma_i\) are the model-implied mean and covariance for the observed subset of variables for case \(i\).

Tip

Key advantage: FIML uses the covariance structure to “borrow strength” from observed variables to recover information about missing ones — all within the model itself.

Path Analysis Example: Simulating the Data

We simulate a simple three-variable path model with a known structure:

\[X_1 \rightarrow X_2 \rightarrow Y, \quad X_1 \rightarrow Y\]

True path coefficients:

  x2 ~ x1 : 0.50

  y  ~ x1 : 0.30

  y  ~ x2 : 0.55

N = 500 | No missing values

Introducing MAR Missingness

We introduce missingness in \(Y\) and \(X_2\) that depends on observed \(X_1\):

Missing in y:  140 ( 28 %)

Missing in x2: 88 ( 17.6 %)

Complete cases: 291 ( 58.2 %)

About half the sample is incomplete. Complete-case analysis would throw away a large fraction of observations.

Fitting the Path Model: Complete Cases vs. FIML

  lhs rhs        method true_value   est    se   bias
1  x2  x1 Complete Case       0.50 0.603 0.056  0.103
2   y  x1 Complete Case       0.30 0.108 0.052 -0.192
3   y  x2 Complete Case       0.55 0.661 0.046  0.111
4  x2  x1          FIML       0.50 0.564 0.042  0.064
5   y  x1          FIML       0.30 0.115 0.049 -0.185
6   y  x2          FIML       0.55 0.654 0.044  0.104

FIML Results: Visual Comparison

FIML: Key Practical Notes

Requires normality — technically assumes multivariate normal data; robust versions (estimator = "MLR") handle modest violations
Works within the model — missingness is handled at the model level, no pre-processing needed
Assumes MAR — if data are MNAR, FIML estimates are still biased
Best for SEM/path models — when you’re already in the lavaan/SEM framework, FIML is the natural choice
No imputed datasets — results cannot be used outside the model; if you need imputed data for downstream analyses, use MICE

Multiple Imputation by Chained Equations (MICE)

The MICE Algorithm (Overview)

Three stages:

Impute — Create \(m\) complete datasets by drawing from the conditional distribution of each missing variable given all others. Each variable gets its own imputation model.
Analyze — Fit the model of interest to each of the \(m\) imputed datasets separately.
Pool — Combine the \(m\) sets of estimates using Rubin’s Rules to produce a single set of estimates and standard errors that reflect both within- and between-imputation uncertainty.

The Key in the MICE Algorithm

Why \(m\) datasets and not one?

Single imputation underestimates uncertainty.
The variation between imputed datasets captures our uncertainty about the missing values.

The Chained Equations (FCS) Approach

MICE imputes each variable in turn, conditioning on all others:

Initialize: fill in starting values for all missing data

For each iteration:
  For each variable Xj with missing values:
    1. Fit model: Xj ~ all other variables (observed + currently imputed)
    2. Draw from the posterior predictive distribution
    3. Replace missing values of Xj with the draws

Repeat for many cycles until convergence
Return the imputed values from the final cycle

Tip

Each variable can have its own imputation model (linear regression, logistic regression, PMM, etc.) — this is the “chained” part. This makes MICE very flexible for mixed variable types.

MICE in R: Setup and Imputation

Variables with missing data:

 x1  x2   y 
  0  88 140

Class: mids
Number of multiple imputations:  20 
Imputation methods:
   x1    x2     y 
   "" "pmm" "pmm" 
PredictorMatrix:
   x1 x2 y
x1  0  1 1
x2  1  0 1
y   1  1 0

Checking MICE Convergence

MICE: Analyzing the Imputed Datasets

=== Pooled: x2 ~ x1 ===

         term     estimate  std.error  statistic      p.value
1 (Intercept) -0.004470189 0.04073208 -0.1097461 9.126792e-01
2          x1  0.555040295 0.04107206 13.5138181 2.021284e-33

MICE: Pooled Results for the Outcome Model

=== Pooled: y ~ x1 + x2 ===

         term    estimate  std.error statistic      p.value
1 (Intercept) -0.05851817 0.03617842 -1.617488 1.080476e-01
2          x1  0.12193906 0.04559310  2.674507 8.690785e-03
3          x2  0.64206688 0.03971654 16.166233 5.077378e-40


=== Fraction of Missing Information (FMI) ===

         term       fmi    lambda
1 (Intercept) 0.2985006 0.2884462
2          x1 0.3630472 0.3509408
3          x2 0.1908664 0.1840055

Three-Way Comparison: Complete Case vs. FIML vs. MICE

     Path True    CC  FIML  MICE Bias_CC Bias_FIML Bias_MICE
1 x2 ~ x1 0.50 0.603 0.564 0.555   0.103     0.064     0.055
2  y ~ x1 0.30 0.108 0.115 0.122  -0.192    -0.185    -0.178
3  y ~ x2 0.55 0.661 0.654 0.642   0.111     0.104     0.092

Both FIML and MICE substantially reduce bias relative to complete-case analysis — and the estimates from the two modern methods are quite similar to each other.

Practical Guidance & Wrap-Up

Which Method Should You Use?

Situation	Recommendation
Fitting a SEM / path model	FIML via `missing = "fiml"` in lavaan
Regression, multilevel, GLM — mixed data types	MICE (`mice` package)
Need to explore imputed data, check distributions	MICE (can inspect imputed datasets)
MNAR suspected	Sensitivity analysis; selection or pattern-mixture models
< 5% missing, MCAR plausible	Listwise may be acceptable — but still check

Which Method Should You Use?

Important

Never use mean imputation in primary analyses reported in a peer-reviewed paper. If a reviewer or collaborator suggests it, cite Little & Rubin (2002) and explain why.

Key Assumptions to Check

Justify MAR — ask: conditional on what I observed, is there any remaining reason values would be missing? More covariates → MAR more plausible.
Check MICE convergence — trace plots should show no trends across iterations.
Inspect imputed values — do imputed values look plausible? PMM helps here by constraining imputations to the observed range.

Key Assumptions to Check

Report what you did — many papers are vague about missing data handling. Report: % missing per variable, assumed mechanism, method used, number of imputed datasets if MICE.
Sensitivity to MNAR — if results hinge on MAR, conduct at least one sensitivity analysis relaxing this assumption.

Summary

Missing data has three mechanisms: MCAR, MAR, and MNAR — the mechanism determines which methods are valid.
Naive methods (listwise, pairwise, mean imputation) are biased or inefficient under anything other than MCAR, and should be avoided for primary analyses.
FIML is the preferred approach within SEM/path models — uses all available data without imputation.
MICE is a general-purpose multiple imputation framework — flexible, transparent, and widely applicable.
Both methods assume MAR — sensitivity analyses are needed if MNAR is plausible.
This was a broad overview — see the references for deeper treatment.