u1 u2 u3 u4
1 2 2 2 2
2 2 2 2 2
3 2 2 2 2
4 2 2 2 2
5 2 2 2 2
6 2 2 2 2
1 2
67 75
Latent Class & Latent Profile Analysis
poLCA)tidyLPA)Latent Class Analysis
\[f(x_i) = \left(\pi_i \right)^{x_i} \left(1-\pi_i\right)^{(1-x_i)}\]
Consider a single binary test item, \(X\):
If \(X = 1\), the likelihood is: \[f(x_i=1) = (0.75)^{1}(1-0.75)^{0} = 0.75\]
If \(X = 0\), the likelihood is: \[f(x_i=0) = (0.75)^{0}(1-0.75)^{1} = 0.25\]
For discrete-outcome variables, the likelihood equals the probability of the event occurring
\[P(X_1=x_1, X_2=x_2,\ldots,X_J=x_J) = \prod_{j=1}^{J} \pi_j^{x_j} \left(1-\pi_j\right)^{\left(1-x_j\right)}\]
\[f(\textbf{X}) = \sum_{g=1}^G \eta_g f(\textbf{X}|g)\]
A latent class model for \(J\) binary indicators (\(j = 1,\ldots,J\)) with \(C\) classes (\(c = 1,\ldots,C\)):
\[f(\mathbf{x}_i) = \displaystyle {\sum_{c=1}^{C} \eta_c} \prod_{j=1}^{J} \pi_{jc}^{x_{ij}} \left(1-\pi_{jc}\right)^{1-x_{ij}}\]
poLCA PackagepoLCA package (Linzer & Lewis, 2011) fits LCA models
install.packages("poLCA")poLCA Package| Argument | Description |
|---|---|
formula |
cbind(item1, item2, ...) ~ 1 |
data |
Data frame; item values must start at 1, not 0 |
nclass |
Number of latent classes to estimate |
nrep |
Number of random starting points (≥ 10 recommended) |
verbose |
FALSE to suppress iteration output |
LCA Example #1
| Variable | Description |
|---|---|
u1 |
Math item 1 (0/1) |
u2 |
Math item 2 (0/1) |
u3 |
Math item 3 (0/1) |
u4 |
Math item 4 (0/1) |
u1 u2 u3 u4
1 2 2 2 2
2 2 2 2 2
3 2 2 2 2
4 2 2 2 2
5 2 2 2 2
6 2 2 2 2
1 2
67 75
Important
poLCA requires each item’s values to be positive integers starting at 1. For binary items originally coded 0/1, add 1 to recode them as 1/2 before fitting.
Conditional item response (column) probabilities,
by outcome variable, for each class (row)
$u1
Pr(1) Pr(2)
class 1: 0.7914 0.2086
class 2: 0.2466 0.7534
$u2
Pr(1) Pr(2)
class 1: 0.9317 0.0683
class 2: 0.2197 0.7803
$u3
Pr(1) Pr(2)
class 1: 0.9821 0.0179
class 2: 0.5684 0.4316
$u4
Pr(1) Pr(2)
class 1: 0.9477 0.0523
class 2: 0.2925 0.7075
Estimated class population shares
0.4134 0.5866
Predicted class memberships (by modal posterior prob.)
0.4577 0.5423
=========================================================
Fit for 2 latent classes:
=========================================================
number of observations: 142
number of estimated parameters: 9
residual degrees of freedom: 6
maximum log-likelihood: -331.7637
AIC(2): 681.5273
BIC(2): 708.1298
G^2(2): 8.965682 (Likelihood ratio/deviance statistic)
X^2(2): 9.459244 (Chi-square goodness of fit)
nrep = 10 runs the algorithm 10 times with different random startsnrep reduces the chance of stopping at a local maximumThree types of information from an LCA model:
From poLCA output:
Estimated class population shares
0.5866 0.4134
Predicted class memberships (modal posterior prob.)
0.5423 0.4577
Note
“Population shares” come from the model. “Predicted memberships” assign each person to their most probable class — these will differ slightly from the model-estimated shares.
From poLCA output (Pr(2) = probability of correct response):
Conditional item response probabilities by class
$u1 Pr(1) Pr(2) $u2 Pr(1) Pr(2)
Class 1: 0.247 0.753 Class 1: 0.220 0.780
Class 2: 0.791 0.209 Class 2: 0.932 0.068
$u3 Pr(1) Pr(2) $u4 Pr(1) Pr(2)
Class 1: 0.568 0.432 Class 1: 0.292 0.708
Class 2: 0.982 0.018 Class 2: 0.948 0.052
Pr(2) is the probability of a correct response (original code = 1)| Item | Class 1 \(\pi_{j1}\) | Class 2 \(\pi_{j2}\) |
|---|---|---|
| u1 | 0.753 | 0.209 |
| u2 | 0.780 | 0.068 |
| u3 | 0.432 | 0.018 |
| u4 | 0.708 | 0.052 |
Assessing Model Fit
\[\chi^2_p = \sum_r \frac{(O_r - E_r)^2}{E_r}\]
For \(J\) binary items, there are \(2^J\) possible response patterns
Limitation: invalid when many cells have small expected frequencies
[1] 9.459244
[1] 8.965682
NULL
[1] 0.1493499
[1] 0.1755173
Results for the Macready & Dayton 2-class solution:
Pearson chi-square (df = 6): 9.459 p = 0.149
G-squared (df = 6): 8.966 p = 0.176
Neither test is significant — the 2-class model fits the data adequately.
\[\log L = \sum_{k=1}^N \log \left( {\sum_{c=1}^{C} \eta_c} \prod_{j=1}^{J} \pi_{jc}^{x_{kj}} \left(1-\pi_{jc}\right)^{1-x_{kj}} \right)\]
\[AIC = 2q - 2\log L \qquad BIC = q\log(N) - 2\log L\]
Classes Parameters LogL AIC BIC
1 1 4 -375 759 770
2 2 9 -332 682 708
3 3 14 -329 687 728
Fit statistics for the Macready & Dayton data:
| Classes | Parameters | Log L | AIC | BIC |
|---|---|---|---|---|
| 1 | 4 | −373.04 | 754.08 | 766.16 |
| 2 | 9 | −331.76 | 681.53 | 708.13 |
| 3 | 14 | −331.49 | 690.97 | 733.10 |
\[E = 1 - \frac{-\displaystyle\sum_{i=1}^N \sum_{c=1}^C \hat{\alpha}_{ic} \log \hat{\alpha}_{ic}}{N \log C}\]
Relative Entropy (2-class): 0.754
Latent Profile Analysis
\[f(x_i) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(\frac{-(x_i - \mu)^2}{2\sigma^2}\right)\]
\[f(\textbf{x}) = \frac{1}{(2\pi)^{p/2}|\boldsymbol{\Sigma}|^{1/2}} \exp\left(-\frac{(\textbf{x}-\boldsymbol{\mu})^T\boldsymbol{\Sigma}^{-1}(\textbf{x}-\boldsymbol{\mu})}{2}\right)\]
Tip
If you suspect within-class correlations, try variances = "equal", covariances = "equal" (Model 2) and compare fit statistics to Model 1.
A latent profile model for \(J\) continuous variables \((j = 1,\ldots,J)\) with \(C\) classes (\(c = 1,\ldots,C\)):
\[f(\mathbf{x}_i) = \displaystyle {\sum_{c=1}^{C} \eta_c} \prod_{j=1}^{J} \frac{1}{\sqrt{2\pi\sigma^2_{jc}}} \exp\left(\frac{-(x_{ij} - \mu_{jc})^2}{2\sigma^2_{jc}}\right)\]
tidyLPA PackagetidyLPA package (Rosenberg et al., 2018) fits LPA models
mclust package with a cleaner interfaceinstall.packages("tidyLPA")tidyLPA Package| Function | Description |
|---|---|
estimate_profiles() |
Fits one or more LPA models |
get_fit() |
Extracts fit statistics (log L, AIC, BIC, entropy) |
get_estimates() |
Extracts class-specific means and variances |
get_data() |
Returns data with posterior probabilities and class assignments |
plot_profiles() |
Creates a class profile plot |
| Model | Variances | Covariances |
|---|---|---|
| 1 (default) | Equal across classes | Zero (diagonal) |
| 2 | Equal across classes | Equal across classes |
| 3 | Varying across classes | Zero (diagonal) |
| Model | Variances | Covariances |
|---|---|---|
| 4 | Varying across classes | Varying across classes |
| 5 | Equal across classes | Varying across classes |
| 6 | Varying across classes | Equal across classes |
LPA Example: Fisher’s Iris Data
| Variable | Measurement |
|---|---|
| x1 | Sepal length (cm) |
| x2 | Sepal width (cm) |
| x3 | Petal length (cm) |
| x4 | Petal width (cm) |
# A tibble: 3 × 20
Model Classes LogLik parameters n AIC AWE BIC CAIC CLC KIC
<dbl> <int> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 2 -489. 13 150 1004. 1145. 1043. 1056. 980. 1020.
2 1 3 -361. 18 150 759. 955. 813. 831. 725. 780.
3 1 4 -356. 23 150 758. 1010. 827. 850. 714. 784.
# ℹ 9 more variables: SABIC <dbl>, ICL <dbl>, Entropy <dbl>, prob_min <dbl>,
# prob_max <dbl>, n_min <dbl>, n_max <dbl>, BLRT_val <dbl>, BLRT_p <dbl>
Note
variances = "equal" and covariances = "zero" specify Model 1 — equal variances across classes, no within-class covariances. This is the standard LPA assumption.
Fit statistics for the Iris LPA (Model 1):
| Classes | Parameters | Log L | AIC | BIC | Entropy |
|---|---|---|---|---|---|
| 2 | 13 | −488.92 | 1003.83 | 1042.97 | 0.991 |
| 3 | 18 | −361.43 | 758.85 | 813.04 | 0.957 |
| 4 | 23 | −310.12 | 666.23 | 735.48 | 0.945 |
# A tibble: 24 × 8
Category Parameter Estimate se p Class Model Classes
<chr> <chr> <dbl> <dbl> <dbl> <int> <dbl> <dbl>
1 Means x1 5.01 0.0508 0 1 1 3
2 Means x2 3.43 0.0535 0 1 1 3
3 Means x3 1.46 0.0218 0 1 1 3
4 Means x4 0.246 0.0155 7.87e-57 1 1 3
5 Variances x1 0.235 0.0261 2.38e-19 1 1 3
6 Variances x2 0.107 0.0143 6.57e-14 1 1 3
7 Variances x3 0.187 0.0254 1.82e-13 1 1 3
8 Variances x4 0.0379 0.00572 3.39e-11 1 1 3
9 Means x1 5.92 0.0750 0 2 1 3
10 Means x2 2.75 0.0443 0 2 1 3
# ℹ 14 more rows
# A tibble: 6 × 10
model_number classes_number x1 x2 x3 x4 CPROB1 CPROB2 CPROB3
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 3 5.1 3.5 1.4 0.2 1 5.37e-20 3.72e-44
2 1 3 4.9 3 1.4 0.2 1 5.82e-19 5.84e-44
3 1 3 4.7 3.2 1.3 0.2 1 1.63e-20 7.19e-46
4 1 3 4.6 3.1 1.5 0.2 1 4.45e-19 4.35e-44
5 1 3 5 3.6 1.4 0.2 1 1.94e-20 1.25e-44
6 1 3 5.4 3.9 1.7 0.4 1 4.69e-16 8.18e-37
# ℹ 1 more variable: Class <dbl>
From get_fit() output:
FINAL CLASS COUNTS AND PROPORTIONS
Class 1: 50.000 (0.333)
Class 2: 54.888 (0.366)
Class 3: 45.112 (0.301)
Class-specific means from get_estimates():
| Variable | Class 1 | Class 2 | Class 3 |
|---|---|---|---|
| x1 (Sepal Length) | 5.01 | 5.92 | 6.68 |
| x2 (Sepal Width) | 3.43 | 2.75 | 3.02 |
| x3 (Petal Length) | 1.46 | 4.33 | 5.61 |
| x4 (Petal Width) | 0.25 | 1.35 | 2.07 |
Class 1 Class 2 Class 3
1.000 0.970 0.966
Concluding Remarks
| Feature | LCA | LPA |
|---|---|---|
| Indicator type | Binary / categorical | Continuous |
| Within-class distribution | Bernoulli | Normal |
| Class parameters | Item probabilities (\(\pi_{jc}\)) | Means (\(\mu_{jc}\)) and variances (\(\sigma^2_{jc}\)) |
| R package | poLCA |
tidyLPA |
nrep)