Bayesian Estimation of Multidimensional Latent Variable Models

Author

Lecture 8

Today’s Lecture Objectives

  1. Show how to estimate multidimensional latent variable models using MCMC (Stan)
  2. Describe why Stan needs to use the Cholesky decomposition for covariance matrices of latent variables

Example Data: Simulated Two Dimensions

Today we will be using the simulated data we started class with and used in Lecture 7

  • 10 items
  • 1000 examinees
  • 2 dimensions
  • Both continuous(for CFA) and categorical data (for IRT)
  • Q-matrix:
Qmatrix
       theta1 theta2
item1       1      0
item2       1      0
item3       1      0
item4       1      0
item5       1      0
item6       0      1
item7       0      1
item8       0      1
item9       0      1
item10      0      1

Our Example: Multivariate Normal Distribution

For our example, we will assume the set of traits follows a multivariate normal distribution

\[ f\left(\boldsymbol{\theta}_p \right) = \left(2 \pi \right)^{-\frac{D}{2}} \det\left(\boldsymbol{\Sigma}_\theta \right)^{-\frac{1}{2}}\exp\left[-\frac{1}{2}\left(\boldsymbol{\theta}_p - \boldsymbol{\mu}_\theta \right)^T\boldsymbol{\Sigma}_\theta^{-1}\left(\boldsymbol{\theta}_p - \boldsymbol{\mu}_\theta \right) \right] \] Where:

  • \(\pi \approx 3.14\)
  • \(D\) is the number of latent variables (dimensions)
  • \(\boldsymbol{\Sigma}_\theta\) is the covariance matrix of the latent variables
    • \(\boldsymbol{\Sigma}_\theta^{-1}\) is the inverse of the covariance matrix
  • \(\boldsymbol{\mu}_\theta\) is the mean vector of the latent variables
  • \(\det\left( \cdot\right)\) is the matrix determinant function
  • \(\left(\cdot \right)^T\) is the matrix transpose operator

Alternatively, we would specify \(\boldsymbol{\theta}_p \sim N_D\left( \boldsymbol{\mu}_\theta, \boldsymbol{\Sigma}_\theta \right)\); but, we cannot always estimate \(\boldsymbol{\mu}_\theta\) and \(\boldsymbol{\Sigma}_\theta\)

Observed Item Model

For the IRT analyses, we will use a two parameter logistic item response model where:

\[ P\left(Y_{pi} = 1 \mid \theta \right) = \frac{\exp(\mu_{i}+ \sum_{d=1}^D q_{id}\lambda_{id}\theta_{pd})}{1+\exp(\mu_{i}+ \sum_{d=1}^D q_{id}\lambda_{id}\theta_{pd})}\]

For the CFA analyses, we will use a CFA model where:

\[ Y_{pi} = \mu_{i}+ \sum_{d=1}^D q_{id}\lambda_{id}\theta_{pd} + \epsilon_{pi}\]

Where:

  • \(Y_{pi}\) is the response of person \(p\) to item \(i\)
  • \(\theta_{pd}\) is the score of person \(p\) on dimension \(d\)
  • \(\mu_{i}\) is the intercept of item \(i\)
  • \(\lambda_{id}\) is the factor loading of item \(i\) on dimension \(d\)
  • \(q_{id}\) is the Q-matrix value for item \(i\) on dimension \(d\)
  • \(\epsilon_{pi}\) is the residual error of person \(p\) on item \(i\)
  • \(D\) is the number of dimensions
  • \(I\) is the number of items

Stan versus JAGS

Bayesian analyses can be implemented in many different software packages

  • I focus today on Stan (but I will leave JAGS code in the folder)
  • Stan:
    • Hamiltonian Monte Carlo (a more efficient type of Metropolis algorithm)
    • Can directly model continuous parameters only (no discrete parameters–DCMs/mixture models are harder)
    • Has two R interfaces
      • rstan – older and not developed as much, if at all, any longer
      • cmdstanr
    • Compiles the model to C++ code (so it is fast)
    • Has built-in LKJ prior for correlation matrices
  • JAGS
    • Uses a number of different sampling methods (Gibbs and Metropolis-Hastings based)
    • Slower than Stan (standalone executable)
    • Can model continuous and discrete parameters
    • Difficult to estimate models with standardized factors and correlations

LKJ Priors for Correlation Matrices

From Stan’s Functions Reference, for a correlation matrix \(\textbf{R}_\theta\)

Correlation Matrix Properties:

  • Positive definite (determinant greater than zero; \(\det\left(R_\theta \right) >0\)
  • Symmetric
  • Diagonal values are all one

LKJ Prior, with hyperparameter \(\eta\), is proportional to the determinant of the correlation matrix

\[\text{LKJ}\left(\textbf{R}_\theta \mid \eta \right) \propto \det\left(\textbf{R}_\theta \right)^{(\eta-1)} \] Where:

  • Density is uniform over correlation matrices with \(\eta=1\)
  • With \(\eta>1\), identity matrix is modal (moves correlations toward zero)
  • With \(0<\eta<1\), density has a trough at the identity (moves correlations away from zero)

For this example, we set \(\eta=1\), noting a uniform prior over all correlation matrices

Cholesky Decomposition

The functions we are using do not use the correlation matrix directly

The problem:

  • The issue: We need the inverse and determinant to calculate the density of a multivariate normal distribution
    • Inverses and determinants are numerically unstable (computationally) in general form

The solution:

  • Convert \(R_\theta\) to a triangular form using the Cholesky decomposition

\[\textbf{R}_\theta = \textbf{L}_\theta \textbf{L}_\theta^T\]

Cholesky Decomposition Math

  • \(\det\left(\textbf{R}_\theta \right) = \det\left(\textbf{L}_\theta \textbf{L}_\theta^T\right) = \det\left(\textbf{L}_\theta\right) \det\left(\textbf{L}_\theta^T\right) = \det\left(\textbf{L}_\theta\right)^2 = \left(\prod_{d=1}^D l_{d,d}\right)^2\) (note: product becomes sum for log-likelihood)
  • For the inverse, we work with the term in the exponent of the MVN pdf:

\[-\frac{1}{2}\left(\boldsymbol{\theta}_p - \boldsymbol{\mu}_\theta \right)^T\textbf{R}_\theta^{-1}\left(\boldsymbol{\theta}_p - \boldsymbol{\mu}_\theta \right) = -\frac{1}{2}\left(\boldsymbol{\theta}_p - \boldsymbol{\mu}_\theta \right)^T\left(\textbf{L}_\theta \textbf{L}_\theta^T\right)^{-1}\left(\boldsymbol{\theta}_p - \boldsymbol{\mu}_\theta \right) \]

\[ -\frac{1}{2}\left(\boldsymbol{\theta}_p - \boldsymbol{\mu}_\theta \right)^T\textbf{L}_\theta^{-T} \textbf{L}_\theta^{-1}\left(\boldsymbol{\theta}_p - \boldsymbol{\mu}_\theta \right)\] Then, we solve by back substitution: \(\left(\boldsymbol{\theta}_p - \boldsymbol{\mu}_\theta \right)^T\textbf{L}_\theta^{-T}\)

  • We can then multiply this result by itself to form the term in the exponent (times \(-\frac{1}{2}\))
  • Back substitution involves far fewer steps, minimizing the amount of rounding error in the process of forming the inverse

Most algorithms using MVN distributions use some variant of this process (perhaps with a different factorization method such as QR)