Generalized Measurement Models: Modeling Observed Data
Lecture 4b
Today’s Lecture Objectives
Show different modeling specifications for different types of item response data
Show how parameterization differs for standardized latent variables vs. marker item scale identification
Example Data: Conspiracy Theories
Today’s example is from a bootstrap resample of 177 undergraduate students at a large state university in the Midwest. The survey was a measure of 10 questions about their beliefs in various conspiracy theories that were being passed around the internet in the early 2010s. Additionally, gender was included in the survey. All items responses were on a 5- point Likert scale with:
Strongly Disagree
Disagree
Neither Agree or Disagree
Agree
Strongly Agree
Please note, the purpose of this survey was to study individual beliefs regarding conspiracies. The questions can provoke some strong emotions given the world we live in currently. All questions were approved by university IRB prior to their use.
Our purpose in using this instrument is to provide a context that we all may find relevant as many of these conspiracy theories are still prevalent today.
Conspiracy Theory Questions 1-5
Questions:
The U.S. invasion of Iraq was not part of a campaign to fight terrorism, but was driven by oil companies and Jews in the U.S. and Israel.
Certain U.S. government officials planned the attacks of September 11, 2001 because they wanted the United States to go to war in the Middle East.
President Barack Obama was not really born in the United States and does not have an authentic Hawaiian birth certificate.
The current financial crisis was secretly orchestrated by a small group of Wall Street bankers to extend the power of the Federal Reserve and further their control of the world’s economy.
Vapor trails left by aircraft are actually chemical agents deliberately sprayed in a clandestine program directed by government officials.
Conspiracy Theory Questions 6-10
Questions:
Billionaire George Soros is behind a hidden plot to destabilize the American government, take control of the media, and put the world under his control.
The U.S. government is mandating the switch to compact fluorescent light bulbs because such lights make people more obedient and easier to control.
Government officials are covertly Building a 12-lane "NAFTA superhighway" that runs from Mexico to Canada through America’s heartland.
Government officials purposely developed and spread drugs like crack-cocaine and diseases like AIDS in order to destroy the African American community.
God sent Hurricane Katrina to punish America for its sins.
Data Visualization: Q1-Q5
Data Visualization: Q6-Q10
Conspiracy Theories: Assumed Latent Variable
For today’s lecture, we will assume each of the 10 items measures a single latent variable representing a person’s tendency to believe in conspiracy theories
We will denote this latent variable as \(\theta_p\) for each person
\(p\) is the index for person (with \(p=1, \ldots, P\))
Modeling Observed Variables with Normal Distributions
Observed Variables with Normal Distributions
A psychometric model posits that one or more hypothesized latent variables predict a person’s response to observed items
Our hypothesized latent variable: Tendency to Believe in Conspiracies (\(\theta_p\))
One variable: Unidimensional
Each observed variable (item response) is included in the model
Today, we will assume each response follows a normal distribution
This is the assumption underlying confirmatory factor analysis (CFA) models
This assumption is tenuous at best
Normal Distribution: Linear Regression
As we saw in linear models, when an outcome variable (here \(Y_p\)) is assumed to follow a (conditional) normal distribution, this places a linear regression-style model on the outcome:
For example, take the following linear regression: \[Y_p = \beta_0 + \beta_1 X_p + e_p,\] with \(e_p \sim N\left(0, \sigma_e \right)\)
\[
\begin{array}{cc}
Y_{pi} = \mu_i + \lambda_i \theta_p + e_{p,i}; & e_{p,i} \sim N\left(0, \psi_i^2 \right) \\
\end{array}
\] The parameters of the model use different notation from typical linear regression models and have different names (they are called item parameters)
\(\mu_i\): Item intercept
The expected score on the item when \(\theta_p = 0\)
Similar to \(\beta_0\)
\(\lambda_i\): Factor loading or item discrimination
The change in the expected score of an item for a one-unit increase in \(\theta_p\)
Similar to \(\beta_1\)
\(\psi^2_i\): Unique variance (Note: In Stan, we will have to specify \(\psi_e\); the unique standard deviation)
The variance of the residuals (the expected score minus observed score)
Similar to residual variance \(\sigma^2_e\)
Model Specification
The set of equations on the previous slide formed step #1 of the Measurement Model Analysis Steps:
Specify Model
The next step is:
Specify scale identification method for latent variables
We will initially assume \(\theta_p \sim N(0,1)\), which allows us to estimate all item parameters of the model
This is what we call a standardized latent variable
They are like Z-scores
Implementing Normal Outcomes in Stan
Implementing Normal Outcomes in Stan
There are a few changes needed to make Stan estimate psychometric models with normal outcomes:
The model (predictor) matrix cannot be used
This is because the latent variable will be sampled–so the model matrix cannot be formed as a constant
The data will be imported as a matrix
More than one outcome means more than one column vector of data
The parameters will be specified as vectors of each type
Each item will have its own set of parameters
Implications for the use of prior distributions
Stan’s model Block
model { lambda ~multi_normal(meanLambda, covLambda); // Prior for item discrimination/factor loadings mu ~multi_normal(meanMu, covMu); // Prior for item intercepts psi ~exponential(psiRate); // Prior for unique standard deviations theta ~normal(0, 1); // Prior for latent variable (with mean/sd specified)for (item in1:nItems){ Y[,item] ~normal(mu[item] + lambda[item]*theta, psi[item]); }}
The loop here conducts the model, separately, for each item
Assumption of conditional independence enables this
Non-independence would need multivariate normal model
The item mean is set by the conditional mean of the model
The item SD is set by the unique variance parameter
The loop puts each item’s parameters into the equation
Stan’s parameters {} Block
parameters { vector[nObs] theta; // the latent variables (one for each person) vector[nItems] mu; // the item intercepts (one for each item) vector[nItems] lambda; // the factor loadings/item discriminations (one for each item) vector<lower=0>[nItems] psi; // the unique standard deviations (one for each item) }
Here, the parameterization of \(\lambda\) (factor loadings/discrimination parameters) can lead to problems in estimation
The issue: \(\lambda_i \theta_p = (-\lambda_i)(-\theta_p)\)
Depending on the random starting values of each of these parameters (per chain), a given chain may converge to a different region
To demonstrate (later), we will start with a different random number seed
Currently using 09102022: works fine
Change to 25102022: big problems
Our fix will be to set starting values for all \(\lambda_i\) and \(\theta_p\)
Stan’s data {} Block
data { int<lower=0> nObs; // number of observations int<lower=0> nItems; // number of items matrix[nObs, nItems] Y; // item responses in a matrix vector[nItems] meanMu; matrix[nItems, nItems] covMu; // prior covariance matrix for coefficients vector[nItems] meanLambda; // prior mean vector for coefficients matrix[nItems, nItems] covLambda; // prior covariance matrix for coefficients vector[nItems] psiRate; // prior rate parameter for unique standard deviations}
Choosing Prior Distributions for Parameters
There is not uniform agreement about the choices of prior distributions for item parameters
We will use uninformative priors on each to begin
After first model analysis, we will discuss these choices and why they were made
The init option is used to set starting values for the parameters
This is especially important for the item discrimination parameters \(\lambda_i\) and the latent variables \(\theta_p\)
This code tells Stan to sample starting values for \(\lambda_i\) from a normal distribution with mean 10 and standard deviation 2
This ensures starting values for \(\lambda_i\) will most likely be positive
For \(\theta_p\), we will need more work…
Initializing Latent Variables
As we expect the latent variable to be highly related to the sum score, we can use the sum score to help us initialize \(\theta_p\) so we end up in the correct mode of the posterior distribution
We will use the standardized sum score as a starting value for \(\theta_p\) (starting value denoted \(\theta_p^*\))
# create standardized sum scores for latent variable initialization valuessumScores =rowSums(conspiracyItems)initTheta = (sumScores -mean(sumScores))/sd(sumScores)
We will use the standardized sum score as the starting value for \(\theta_p\) in the init option
In my example, I set the standard deviation of the normal distribution to zero (meaning the function just returns the mean) – you can also choose to make this a small value allowing for some variability
The posterior distribution of the person parameters (the latent variable; for a single person): \[
f(\theta_p \mid \boldsymbol{Y}) \propto f\left(\boldsymbol{Y} \mid \theta_p \right)
f\left(\theta_p \right)
\]
Here:
\(f(\theta_p \mid \boldsymbol{Y})\) is the posterior distribution of the latent variable conditional on the observed data
\(f\left(\boldsymbol{Y} \mid \theta_p \right)\) is the model (data) likelihood
parameters { vector[nObs] theta; // the latent variables (one for each person) vector[nItems] mu; // the item intercepts (one for each item) vector[nItems] lambda; // the factor loadings/item discriminations (one for each item) vector<lower=0>[nItems] psi; // the unique standard deviations (one for each item) }
Here, the parameterization of \(\lambda\) (factor loadings/discrimination parameters) can lead to problems in estimation
The issue: \(\lambda_i \theta_p = (-\lambda_i)(-\theta_p)\)
Depending on the random starting values of each of these parameters (per chain), a given chain may converge to a different region
To demonstrate (later), we will start with a different random number seed
Currently using 09102022: works fine
Change to 25102022: big problems
New Samples Syntax
Trying the same model with a different random number seed:
Stan allows starting values to be set via cmdstanr
Documentation is very lacking, but with some trial and a lot of error, I will show you how
Alternatively:
Restrict \(\lambda\) to be positive
Truncates prior distribution with MVN
Can also choose prior that has strictly positive range (like log-normal)
Note: The restriction on the space of \(\lambda\) will not permit truely negative values
Not ideal as negative \(\lambda\) values are informative as a problem with data
parameters { vector[nObs] theta; // the latent variables (one for each person) vector[nItems] mu; // the item intercepts (one for each item) vector<lower=0>[nItems] lambda; // the factor loadings/item discriminations (one for each item) vector<lower=0>[nItems] psi; // the unique standard deviations (one for each item) }
Setting Starting Values in Stan
Starting values (initial values) are the first values used when an MCMC chain starts
In Stan, by default, parameters are randomly started between -2 and 2
Bounded parameters are transformed so they are unbounded in the algorithm
What we need:
Randomly start all \(\lambda\) parameters so that they converge to the \(\lambda_i\theta_p\) mode
As opposed to the \((-\lambda_i)(-\theta_p)\) mode
cmdstanr Syntax for Initial Values
Add the init option to the $sample() function of the cmdstanr object:
# set starting values for some of the parametersmodelCFA_samples2fixed = modelCFA_stan$sample(data = modelCFA_data,seed =25102022,chains =4,parallel_chains =4,iter_warmup =2000,iter_sampling =2000, init =function() list(lambda=rnorm(nItems, mean=10, sd=2)))
The init option can be specified as a function, here, randomly starting each \(\lambda\) following a normal distribution
Initialization Process
See the lecture R syntax for information on how to confirm starting values are set