Proportion missing: 0.3
Mean of x for observed y: -0.037
Mean of x for missing y: 0.015
A Brief Introduction
Note
This lecture is a broad overview only. Missing data is a deep field with a rich methodological literature. Our goal today is to build intuition, introduce vocabulary, and demonstrate two modern workflows — not to provide exhaustive technical coverage.
“The problem of missing data is one of the most pervasive in the practice of statistics.” — Little & Rubin (2002)
Real data is almost never complete. Common sources:
The danger: Most software defaults to complete-case analysis — silently dropping rows with any missing values.
Even a moderate amount of non-random missingness visibly biases our estimate.
Let \(Y\) be the full data matrix and \(R\) be the missingness indicator (1 = observed, 0 = missing).
| Mechanism | Definition | Intuition |
|---|---|---|
| MCAR | \(P(R \mid Y) = P(R)\) | Missingness is random coin flip, unrelated to any data |
| Mechanism | Definition | Intuition |
|---|---|---|
| MAR | \(P(R \mid Y) = P(R \mid Y_{obs})\) | Missingness depends on observed data, not on the missing values themselves |
| Mechanism | Definition | Intuition |
|---|---|---|
| MNAR | \(P(R \mid Y) = P(R \mid Y_{obs}, Y_{mis})\) | Missingness depends on the missing values themselves |
Important
MCAR ⊂ MAR. MCAR is a special case of MAR. MNAR is the hardest — and most realistic — case.
Definition: The probability of missingness is the same for all units, regardless of any data values.
Proportion missing: 0.3
Mean of x for observed y: -0.037
Mean of x for missing y: 0.015
Under MCAR, observed and missing groups look similar on all covariates.
Test: Little’s MCAR test (naniar::mcar_test() in R) — though low power in small samples.
Definition: Missingness depends on observed variables, but NOT on the variable that is missing.
Proportion missing: 0.487
Mean of x for observed y: -0.509
Mean of x for missing y: 0.493
X predicts missingness — but once we condition on X, missingness is random.
Key implication: Methods that use the full covariate structure (FIML, MICE) can recover unbiased estimates under MAR.
Definition: Missingness depends on the unobserved values themselves, even after conditioning on all observed data.
Proportion missing: 0.473
Mean of OBSERVED y: -1.266
Mean of TRUE y: -0.06
Warning
MNAR is the hardest case. Neither FIML nor MICE provides a general solution. Addressing MNAR requires sensitivity analyses and explicit models for the missingness mechanism.
What it does: Drop any observation with a missing value on any variable in the model.
Original N: 10
After listwise deletion N: 5
Percent retained: 50 %
Problems:
What it does: Use all available observations for each pairwise calculation (e.g., each cell of a correlation matrix).
x1 x2 x3
x1 1.00 0.510 0.070
x2 0.51 1.000 0.384
x3 0.07 0.384 1.000
Problem: Each correlation uses a different subset of the data.
What it does: Replace each missing value with the variable’s observed mean.
True SD: 1.949
Observed SD: 1.949
After mean imp SD: 1.629
Problems:
What it does: Predict missing values from a regression on observed variables. Slightly better than mean imputation, but still flawed.
True residual SD: 1.421
Imputed residual SD: 1.158
Problem: Single imputation replaces each missing value with a point estimate — no residual noise.
Under FIML (also called direct ML or raw ML):
The likelihood for observation \(i\) with observed variables \(Y_i^{obs}\):
\[\ell_i(\theta) = -\frac{1}{2} \left[ \log |\Sigma_i| + (Y_i^{obs} - \mu_i)^\top \Sigma_i^{-1} (Y_i^{obs} - \mu_i) \right]\]
where \(\mu_i\) and \(\sigma_i\) are the model-implied mean and covariance for the observed subset of variables for case \(i\).
Tip
Key advantage: FIML uses the covariance structure to “borrow strength” from observed variables to recover information about missing ones — all within the model itself.
We simulate a simple three-variable path model with a known structure:
\[X_1 \rightarrow X_2 \rightarrow Y, \quad X_1 \rightarrow Y\]
True path coefficients:
x2 ~ x1 : 0.50
y ~ x1 : 0.30
y ~ x2 : 0.55
N = 500 | No missing values
We introduce missingness in \(Y\) and \(X_2\) that depends on observed \(X_1\):
Missing in y: 140 ( 28 %)
Missing in x2: 88 ( 17.6 %)
Complete cases: 291 ( 58.2 %)
About half the sample is incomplete. Complete-case analysis would throw away a large fraction of observations.
lhs rhs method true_value est se bias
1 x2 x1 Complete Case 0.50 0.603 0.056 0.103
2 y x1 Complete Case 0.30 0.108 0.052 -0.192
3 y x2 Complete Case 0.55 0.661 0.046 0.111
4 x2 x1 FIML 0.50 0.564 0.042 0.064
5 y x1 FIML 0.30 0.115 0.049 -0.185
6 y x2 FIML 0.55 0.654 0.044 0.104
estimator = "MLR") handle modest violationsThree stages:
Impute — Create \(m\) complete datasets by drawing from the conditional distribution of each missing variable given all others. Each variable gets its own imputation model.
Analyze — Fit the model of interest to each of the \(m\) imputed datasets separately.
Pool — Combine the \(m\) sets of estimates using Rubin’s Rules to produce a single set of estimates and standard errors that reflect both within- and between-imputation uncertainty.
Why \(m\) datasets and not one?
MICE imputes each variable in turn, conditioning on all others:
Initialize: fill in starting values for all missing data
For each iteration:
For each variable Xj with missing values:
1. Fit model: Xj ~ all other variables (observed + currently imputed)
2. Draw from the posterior predictive distribution
3. Replace missing values of Xj with the draws
Repeat for many cycles until convergence
Return the imputed values from the final cycle
Tip
Each variable can have its own imputation model (linear regression, logistic regression, PMM, etc.) — this is the “chained” part. This makes MICE very flexible for mixed variable types.
Variables with missing data:
x1 x2 y
0 88 140
Class: mids
Number of multiple imputations: 20
Imputation methods:
x1 x2 y
"" "pmm" "pmm"
PredictorMatrix:
x1 x2 y
x1 0 1 1
x2 1 0 1
y 1 1 0
=== Pooled: x2 ~ x1 ===
term estimate std.error statistic p.value
1 (Intercept) -0.004470189 0.04073208 -0.1097461 9.126792e-01
2 x1 0.555040295 0.04107206 13.5138181 2.021284e-33
=== Pooled: y ~ x1 + x2 ===
term estimate std.error statistic p.value
1 (Intercept) -0.05851817 0.03617842 -1.617488 1.080476e-01
2 x1 0.12193906 0.04559310 2.674507 8.690785e-03
3 x2 0.64206688 0.03971654 16.166233 5.077378e-40
=== Fraction of Missing Information (FMI) ===
term fmi lambda
1 (Intercept) 0.2985006 0.2884462
2 x1 0.3630472 0.3509408
3 x2 0.1908664 0.1840055
Path True CC FIML MICE Bias_CC Bias_FIML Bias_MICE
1 x2 ~ x1 0.50 0.603 0.564 0.555 0.103 0.064 0.055
2 y ~ x1 0.30 0.108 0.115 0.122 -0.192 -0.185 -0.178
3 y ~ x2 0.55 0.661 0.654 0.642 0.111 0.104 0.092
Both FIML and MICE substantially reduce bias relative to complete-case analysis — and the estimates from the two modern methods are quite similar to each other.
| Situation | Recommendation |
|---|---|
| Fitting a SEM / path model | FIML via missing = "fiml" in lavaan |
| Regression, multilevel, GLM — mixed data types | MICE (mice package) |
| Need to explore imputed data, check distributions | MICE (can inspect imputed datasets) |
| MNAR suspected | Sensitivity analysis; selection or pattern-mixture models |
| < 5% missing, MCAR plausible | Listwise may be acceptable — but still check |
Important
Never use mean imputation in primary analyses reported in a peer-reviewed paper. If a reviewer or collaborator suggests it, cite Little & Rubin (2002) and explain why.
Justify MAR — ask: conditional on what I observed, is there any remaining reason values would be missing? More covariates → MAR more plausible.
Check MICE convergence — trace plots should show no trends across iterations.
Inspect imputed values — do imputed values look plausible? PMM helps here by constraining imputations to the observed range.
Report what you did — many papers are vague about missing data handling. Report: % missing per variable, assumed mechanism, method used, number of imputed datasets if MICE.
Sensitivity to MNAR — if results hinge on MAR, conduct at least one sensitivity analysis relaxing this assumption.
