Most estimation routines do one of three things:
Minimize Something: Typically found with names that have “least” in the title. Forms of least squares include “Generalized”, “Ordinary”, “Weighted”, “Diagonally Weighted”, “WLSMV”, and “Iteratively Reweighted.” Typically the estimator of last resort due to less desirable properties than ML estimators.
Maximize Something: Typically found with names that have “maximum” in the title. Forms include “Maximum likelihood”, “Marginal Maximum Likelihood”, “Residual Maximum Likelihood” (REML), “Robust ML”. Typically the gold standard of estimators.
Use Simulation to Sample from Something: more recent advances in estimation use sampling techniques. Names include “Bayesian Markov Chain Monte Carlo”, “Gibbs Sampling”, “Metropolis Hastings”, “Hamiltonian Monte Carlo”, and just “Monte Carlo”. Used for complex models where ML is not available or for methods where prior values are needed.
Provided several assumptions (also called regularity conditions) are met, maximum likelihood estimators have good statistical properties:
Asymptotic Consistency: as the sample size increases, the estimator converges in probability to its true value
Asymptotic Normality: as the sample size increases, the distribution of the estimator is normal (with variance given by the inverse information matrix)
Efficiency: No other estimator will have a smaller standard error
Because they have such nice and well understood properties, MLEs are commonly used in statistical estimation
For a single random variable \(x \sim N\left( \mu_x, \sigma^2_x \right)\), the univariate normal probability density function (PDF) is
\[f(x) = \frac{1}{\sqrt{2\pi\sigma^2_x}} \exp \left(- \frac{\left(x-\mu_x \right)^2}{2\sigma^2_x} \right)\]
For values of \(x\), \(\mu_x\), and \(\sigma^2_x\), \(f(x)\) gives the height of the curve (the likelihood; like a relative frequency)
\[f(x) = \frac{1}{\sqrt{2\pi\sigma^2_x}} \exp \left(- \frac{\left(x-\mu_x \right)^2}{2\sigma^2_x} \right)\]
\[L \left(\mu_x \mid x = 112, \sigma_x^2 = 189.6 \right) = \frac{1}{\sqrt{2\pi * 189.6}} \exp \left(- \frac{\left(112-\mu_x \right)^2}{2*189.6} \right) \]
\[L \left( \mu_x, \sigma^2_x \mid x_1, \ldots, x_N \right) = L \left( \mu_x, \sigma^2_x \mid x_1\right) \times L \left( \mu_x, \sigma^2_x \mid x_2\right) \times \ldots L \left( \mu_x, \sigma^2_x \mid x_N\right)\]
\[ = \prod_{p=1}^N f\left(x_p\right) = \prod_{p=1}^N = \frac{1}{\sqrt{2\pi\sigma^2_x}} \exp \left(- \frac{\left(x-\mu_x \right)^2}{2\sigma^2_x} \right) = \]
\[\left( 2\pi \sigma_x^2 \right)^{\left(- \frac{N}{2} \right)} \exp \left( - \sum_{p=1}^N \frac{\left(x_p - \mu_p \right)^2}{2\sigma_x^2} \right) \]
\(x_{n+1} = x_n - \frac{f\left(x_n\right)}{f'\left(x_n\right)}\)
optim
function$# ML with the Multivariate Normal Distribution
\[f \left( \boldsymbol{x}_p \right) = \frac{1}{\left(2 \pi \right)^{\frac{V}{2}} \mid \boldsymbol{\Sigma} \mid^{\frac{1}{2}}} \exp \left[ - \frac{\left( \boldsymbol{x}_p^T - \boldsymbol{\mu }\right)^T \boldsymbol{\Sigma}^{-1}\left( \boldsymbol{x}_p^T - \boldsymbol{\mu }\right)}{2} \right] \]
\[f \left( \boldsymbol{x}_p \right) = \frac{1}{\left(2 \pi \right)^{\frac{V}{2}} \mid \boldsymbol{\Sigma} \mid^{\frac{1}{2}}} \exp \left[ - \frac{\left( \boldsymbol{x}_p^T - \boldsymbol{\mu }\right)^T \boldsymbol{\Sigma}^{-1}\left( \boldsymbol{x}_p^T - \boldsymbol{\mu }\right)}{2} \right] \]
\[\boldsymbol{\Sigma} = \begin{bmatrix} \sigma_{x_1}^2 & \sigma_{x_1x_2} & \ldots & \sigma_{x_1x_V} \\ \sigma_{x_2x_1} & \sigma_{x_2}^2 & \ldots& \sigma_{x_2x_V} \\ \vdots & \ddots & \ddots & \vdots \\ \sigma_{x_Vx_1} & \sigma_{x_Vx_2} & \ldots& \sigma_{x_V}^2 \\ \end{bmatrix} \]
\[ \boldsymbol{\mu} = \begin{bmatrix} \mu_{x_1} \\ \mu_{x_2} \\ \end{bmatrix} = \begin{bmatrix} 0 \\ 0 \\ \end{bmatrix} ; \boldsymbol{\Sigma} =\begin{bmatrix} \sigma_{x_1}^2 & \sigma_{x_1, x_2} \\ \sigma_{x_1, x_2} & \sigma_{x_2}^2 \\ \end{bmatrix} = \begin{bmatrix} 1 & 0.5 \\ 0.5 & 1 \\ \end{bmatrix} \]
\[ \boldsymbol{\mu} = \begin{bmatrix} \mu_{x_1} \\ \mu_{x_2} \\ \end{bmatrix} = \begin{bmatrix} 0 \\ 0 \\ \end{bmatrix} ; \boldsymbol{\Sigma} =\begin{bmatrix} \sigma_{x_1}^2 & \sigma_{x_1, x_2} \\ \sigma_{x_1, x_2} & \sigma_{x_2}^2 \\ \end{bmatrix} = \begin{bmatrix} 1 & 0.5 \\ 0.5 & 1 \\ \end{bmatrix} \]
\[ f \left( \boldsymbol{x}_1, \boldsymbol{x}_2, \ldots, \boldsymbol{x}_N \right) = f \left( \boldsymbol{x}_1\right) \times f \left( \boldsymbol{x}_2\right) \times \ldots \times f \left( \boldsymbol{x}_N\right) = \prod_{p=1}^N f \left( \boldsymbol{x}_p\right) = \]
\[ \prod_{p=1}^N \frac{1}{\left(2 \pi \right)^{\frac{V}{2}} \mid \boldsymbol{\Sigma} \mid^{\frac{1}{2}}} \exp \left[ - \frac{\left( \boldsymbol{x}_p^T - \boldsymbol{\mu }\right)^T \boldsymbol{\Sigma}^{-1}\left( \boldsymbol{x}_p^T - \boldsymbol{\mu }\right)}{2} \right] = \]
\[ \left(2\pi \right)^{-\frac{NV}{2}} \mid \boldsymbol{\Sigma} \mid^{-\frac{-N}{2}} \exp \left[ \sum_{p=1}^{N} - \frac{\left( \boldsymbol{x}_p^T - \boldsymbol{\mu }\right)^T \boldsymbol{\Sigma}^{-1}\left( \boldsymbol{x}_p^T - \boldsymbol{\mu }\right)}{2} \right] \]
From the previous slide:
\[ L\left( \boldsymbol{X} \mid \boldsymbol{\mu}, \boldsymbol{\Sigma} \right) = \left(2\pi \right)^{-\frac{NV}{2}} \mid \boldsymbol{\Sigma} \mid^{-\frac{-N}{2}} \exp \left[ \sum_{p=1}^{N} - \frac{\left( \boldsymbol{x}_p^T - \boldsymbol{\mu }\right)^T \boldsymbol{\Sigma}^{-1}\left( \boldsymbol{x}_p^T - \boldsymbol{\mu }\right)}{2} \right] \]
For this function, there is one mean vector (\(\boldsymbol{\mu}\)), one covariance matrix (\(\boldsymbol{\Sigma}\)), and one data matrix (\(\boldsymbol{X}\))
If we observe the data but do not know the mean vector and/or covariance matrix, then we call this the sample likelihood function
Rather than provide the height of the curve of any value of \(x\), the sample likelihood function provides the likelihood for any values of \(\boldsymbol{\mu}\) and \(\boldsymbol{\Sigma}\)
\[ \log \left( L \right) = \log \left[ \right] \left(2\pi \right)^{-\frac{NV}{2}} \mid \boldsymbol{\Sigma} \mid^{-\frac{-N}{2}} \exp \left[ \sum_{p=1}^{N} - \frac{\left( \boldsymbol{x}_p^T - \boldsymbol{\mu }\right)^T \boldsymbol{\Sigma}^{-1}\left( \boldsymbol{x}_p^T - \boldsymbol{\mu }\right)}{2} \right] = \]
\[-\frac{NV}{2} \log \left( 2\pi \right) - \frac{N}{2} \log \left( \left| \boldsymbol{\Sigma} \right| \right) - \sum_{p=1}^N - \frac{\left( \boldsymbol{x}_p^T - \boldsymbol{\mu }\right)^T \boldsymbol{\Sigma}^{-1}\left( \boldsymbol{x}_p^T - \boldsymbol{\mu }\right)}{2} \]
\[\hat{\mu} = \begin{bmatrix} 100 \\ 10.35 \end{bmatrix}\]
\[ \log \left( L \right) = \log \left( 5.494e^{-55} \right) = -124.9385\]
\[D = \Delta -2 \log \left( L \right) = -2 \times \left( \log \left( L_{H_0}\right) - \log \left( L_{H_A}\right) \right) \]
\[W = \frac{\hat{\theta} - \theta_0}{SE\left(\hat{\theta}\right)}\]
Information criteria are used to compare models
As an example, the Akaike Information Criterion (AIC) is defined as:
\[AIC = -2 \log \left( L \right) + 2p\]
The model with the lowest value is preferred
\[ LR_{RS} = \frac{-2 \left( \log \left( L \right)_{restricted} - \log \left( L \right)_{full} \right)}{c_{LR}}\]
\[c_{LR} = \| \frac{\left(q_{restricted} \right) \left(c_{restricted} \right)-\left(q_{full} \right) \left(c_{full} \right)}{\left(q_{restricted} - q_{full} \right)} \| \]
At the core of both Bayesian and ML-based methods is the likelihood function
\[f\left(\beta_0, \sigma^2_e \mid \boldsymbol{y}_p \right) = \frac{f\left( \boldsymbol{y}_p \mid \beta_0, \sigma^2_e \right) f \left(\beta_0 \right)f \left(\sigma^2_e \right)}{f \left(\boldsymbol{y}_p \right)} \]
\[\beta_0^* \sim Q \left( \beta_0^* \mid \beta_0 \right); \sigma_e^{2^*} \sim Q \left( \sigma_e^{2^*} \mid \sigma_e^{2} \right)\]
\[r_{MHG} = \frac{f\left( \boldsymbol{y}_p \mid \beta_0^*, \sigma^{2^*}_e \right) f \left(\beta_0^* \right)f \left(\sigma^{2^*}_e \right)Q \left( \beta_0 \mid \beta_0^* \right)Q \left( \sigma_e^{2} \mid \sigma_e^{2^*} \right)}{f\left( \boldsymbol{y}_p \mid \beta_0, \sigma^2_e \right) f \left(\beta_0 \right)f \left(\sigma^2_e \right)Q \left( \beta_0^* \mid \beta_0 \right)Q \left( \sigma_e^{2^*} \mid \sigma_e^{2} \right)} \]
\[f\left(\beta_0, \sigma^2_e \mid \boldsymbol{y}_p \right) = \frac{f\left( \boldsymbol{y}_p \mid \beta_0, \sigma^2_e \right) f \left(\beta_0 \right)f \left(\sigma^2_e \right)}{f \left(\boldsymbol{y}_p \right)} \]
With the basics of estimation methods covered, we now turn our focus to the estimation of multidimensional measurement models
Neyman, J. & Scott, E. L. (1948). Consistent estimation from partially consistent observations. Econometrica, 16, 1-32.
Perhaps the most classic example of estimation via MML is that for CFA models.
\[\boldsymbol{Y}_p \sim N \left(\boldsymbol{\mu} , \boldsymbol{\Lambda}_{\boldsymbol{Q}} \boldsymbol{\Phi} \boldsymbol{\Lambda}_{\boldsymbol{Q}}^T + \boldsymbol{\Psi} \right)\]
Joreskog, K. G. (1969). A general approach to confirmatory maximum likelihood factor analysis. Psychometrika, 34, 183-202.
More specifically, for a single observation, the likelihood of the parameters given the data is given by:
\[L\left(\boldsymbol{\mu}, \boldsymbol{\Sigma} \right) = -\frac{NV}{2} \log \left( 2\pi \right) - \frac{N}{2} \log \left( \left| \boldsymbol{\Sigma} \right| \right) - \sum_{p=1}^N - \frac{\left( \boldsymbol{Y}_p^T - \boldsymbol{\mu }\right)^T \boldsymbol{\Sigma}^{-1}\left( \boldsymbol{Y}_p^T - \boldsymbol{\mu }\right)}{2} \]
For the CFA model, this becomes:
\[L\left(\boldsymbol{\mu}, \boldsymbol{\Lambda}_{\boldsymbol{Q}}, \boldsymbol{\Phi}, \boldsymbol{\Psi} \right) = -\frac{NV}{2} \log \left( 2\pi \right) - \frac{N}{2} \log \left( \left| \boldsymbol{\Lambda}_{\boldsymbol{Q}} \boldsymbol{\Phi} \boldsymbol{\Lambda}_{\boldsymbol{Q}}^T + \boldsymbol{\Psi} \right| \right) - \sum_{p=1}^N - \frac{\left( \boldsymbol{Y}_p^T - \boldsymbol{\mu }\right)^T \left(\boldsymbol{\Lambda}_{\boldsymbol{Q}} \boldsymbol{\Phi} \boldsymbol{\Lambda}_{\boldsymbol{Q}}^T + \boldsymbol{\Psi} \right)^{-1}\left( \boldsymbol{Y}_p^T - \boldsymbol{\mu }\right)}{2} \]
We must then build full sample likelihood by taking the sum of the log likelihoods for each observation
As shown in Kaplan (2009; Chapter 2), the CFA model likelihood function can be written as:
\[L\left(\boldsymbol{\mu}, \boldsymbol{\Lambda}_{\boldsymbol{Q}}, \boldsymbol{\Phi}, \boldsymbol{\Psi} \right) = -\frac{N}{2} \left[ \log \left| \boldsymbol{\Lambda}_{\boldsymbol{Q}} \boldsymbol{\Phi} \boldsymbol{\Lambda}_{\boldsymbol{Q}}^T + \boldsymbol{\Psi} \right| + \text{tr}\left(S^{-1}\left( \boldsymbol{\Lambda}_{\boldsymbol{Q}} \boldsymbol{\Phi} \boldsymbol{\Lambda}_{\boldsymbol{Q}}^T + \boldsymbol{\Psi}\right) \right)\right] \]
Often, this gets rephrased from a likelihood function to a loss function (which is minimized):
\[F_{ML} = \log \left| \boldsymbol{\Lambda}_{\boldsymbol{Q}} \boldsymbol{\Phi} \boldsymbol{\Lambda}_{\boldsymbol{Q}}^T + \boldsymbol{\Psi} \right| + \text{tr} \left[S^{-1} \left( \boldsymbol{\Lambda}_{\boldsymbol{Q}} \boldsymbol{\Phi} \boldsymbol{\Lambda}_{\boldsymbol{Q}}^T + \boldsymbol{\Psi}\right) \right] - \log \left| S \right| - V\]
Under MVN data, the limited information likelihood function is equivalent to the full information likelihood function when there is no missing data
Estimation of IRT models with MML is more difficult than CFA
\[ P \left( \boldsymbol{Y}_{pi} = \boldsymbol{y}_{pi} \right) = \int_{\boldsymbol{\theta}_p} P \left( \boldsymbol{Y}_{pi} = \boldsymbol{y}_{pi} \mid \boldsymbol{\theta}_p \right) P \left( \boldsymbol{\theta}_p \right) d \boldsymbol{\theta}_p \]
\[ L\left( \boldsymbol{\theta}, \boldsymbol{\Lambda}_{\boldsymbol{Q}}, \boldsymbol{\Phi}, \boldsymbol{\Psi} \right) = \prod_{p=1}^N \int_{\boldsymbol{\theta}_p} P \left( \boldsymbol{Y}_{pi} = \boldsymbol{y}_{pi} \mid \boldsymbol{\theta}_p \right) P \left( \boldsymbol{\theta}_p \right) d \boldsymbol{\theta}_p \]
Integrals are evaluated numerically (i.e., by computer) by an approximation, here using the midpoint quadrature rule:
\[\int_A^B f(x)dx \approx \sum_{r=1}^R f(m_r) \left(\frac{b_r-a_r}{R} \right) \]
Where:
It can be shown that
\[\lim_{r\rightarrow \infty} \sum_{r=1}^R f(m_r) \left(\frac{b_r-a_r}{R} \right) =\int_a^b f(x)dx \]
In IRT (or, when any observed variable is not normally distributued), we must use this marginalized likelihood for full information ML estimation
First shown in IRT by Bock and Aitken (1981), the Expectation-Maximization algorithm was built to reduce the computational burden of MML estimation
Still requires integration
The E-M algorithm is an iterative algorithm that uses the observed data to estimate the latent variables
The algorithm begins with an initial guess of the latent variables
The algorithm then iterates between two steps until convergence is reached:
The algorithm is guaranteed to converge to a local maximum of the likelihood function
Bock, R. D., & Aitken, M. (1981). Marginal maximum likelihood estimation of item parameters: Application of an EM Algorithm. Psychometrika, 46, 443-459.
In the E-Step, the expected values of latent variables.
\[ f \left( \boldsymbol{\theta}_p \mid \boldsymbol{Y}_p \right) = \frac{f \left(\boldsymbol{Y}_p \mid \boldsymbol{\theta}_p \right)f\left( \boldsymbol{\theta}_p \right)}{f \left(\boldsymbol{\theta}_p \right)}\]
\[E\left(\boldsymbol{\theta}_p \mid \boldsymbol{Y}_p) \right) = \int_{\boldsymbol{\theta}_p} \boldsymbol{\theta}_p \left(\frac{f \left(\boldsymbol{Y}_p \mid \boldsymbol{\theta}_p \right)f\left( \boldsymbol{\theta}_p \right)}{f \left(\boldsymbol{\theta}_p \right)} \right) \partial\boldsymbol{\theta}_p\]
The E-M algorithm is also computationally intensive for models with more than just a few latent variables, so some shortcuts have been developed:
Monte Carlo integration is a method of numerical integration that uses random sampling to approximate the integral
\[\int_\Omega f\left(\boldsymbol{x}\right)d\boldsymbol{x} \approx V \frac{1}{N} \sum_{i=1}^N f(\boldsymbol{x}) \]
Where:
Monte Carlo integration does not give exact results (and needs the same random number seed to be replicable) * Implications for model comparisons with likelihoods (i.e., likelihood ratio tests)
The Metropolis-Hastings Robbins-Monro algorithm is a hybrid of the E-M algorithm and MCMC estimation
In cases where observed variables are categorical (either binary or polytomous), limited information estimation can be used
\[Y_{pi} = 1 \text{ if } \tilde{Y}_{pi} > \tau_i\]
The tetrachoric correlation is a measure of the association between two binary variables (say \(Y_1\) and \(Y_2\)).
The correlation comes from mapping the binary variables \((Y_1 , Y_2)\) onto two “underlying” continuous variables \((\tilde{Y}_1, \tilde{Y}_2)\).
Each of the continuous variables is bisected by a threshold (\(\tau_1\) and \(\tau_2\)) which transforms the continuous response into a categorical outcome.
The distribution of the \((\tilde{Y}_1,\tilde{Y}_2)\) is \(N\left(\boldsymbol{\mu}= \mathbf{0}, \boldsymbol{\Xi}=\left[ \begin{array}{cc} 1 & \rho\\ \rho & 1 \end{array}\right] \right)\)
The polychoric correlation is a similar statistic when \(Y\) has more than two categories
For the margins:
\(P(Y_1=0) = P(\tilde{Y}_1<\tau_1)=\)
\[ \int_{-\infty}^{\tau_1} \int_{-\infty}^{\infty} \frac{1}{2\pi \sqrt{1-\rho^2}} \exp\left(\tilde{Y}_1^2 - 2 \rho \tilde{Y}_1 \tilde{Y}_2 + \tilde{Y}_2^2 \right) d\tilde{Y}_2 d\tilde{Y}_1\]
\(P(Y_1=0, Y_2=0) = P(\tilde{Y}_1<\tau_1, \tilde{Y}_2<\tau_2)=\)
\[ \int_{-\infty}^{\tau_1} \int_{-\infty}^{\tau_2} \frac{1}{2\pi \sqrt{1-\rho^2}} \exp\left(\tilde{Y}_1^2 - 2 \rho \tilde{Y}_1 \tilde{Y}_2 + \tilde{Y}_2^2 \right) d\tilde{Y}_2 d\tilde{Y}_1\]
In limited information, the model-implied correlation matrix is derived from the polychoric correlation matrix
\[\boldsymbol{\rho} = f\left(\boldsymbol{\Lambda}_{\boldsymbol{Q}} \boldsymbol{\Phi} \boldsymbol{\Lambda}_{\boldsymbol{Q}}^T + \boldsymbol{\Psi} \right)\]
In lavaan and Mplus, this method goes by the name WLSMV (weighted least squares mean and variance adjustment)
Works much faster than ML when you have small samples or many factors to estimate (because no numeric integration is required)
Does assume missing data are missing completely at random, whereas full information assumes only missing at random (conditionally random)
Model coefficients will be on the probit scale instead of logit scale
Model comparisons use rescaled \(\chi^2\) values (instead of likelihood ratio tests)
Estimation of factor scores becomes an issue
Instead of MML, we can also use Bayesian estimation methods
See my course on the topic https://jonathantemplin.com/bayesian-psychometric-modeling-fall-2022/
Today’s lecture was all about estimation
Multidimensional Measurement Models (Fall 2023): Lecture 6