Maximum Likelihood Estimation with Missing Data

Author

Jonathan Templin

Published

March 5, 2025

Introduction

This document demonstrates Maximum Likelihood Estimation (MLE) in the presence of missing data, following Chapter 3 of Enders (2022). The examples utilize the lavaan package for structural equation modeling (SEM) and demonstrate the handling of missing data using FIML (Full Information Maximum Likelihood).

Loading Required Packages

We begin by loading necessary packages. If they are not installed, the script will install them first.

if (!require(mvtnorm)) install.packages("mvtnorm"); library(mvtnorm)

Loading required package: mvtnorm

if (!require(lavaan)) install.packages("lavaan"); library(lavaan)

Loading required package: lavaan

This is lavaan 0.6-19
lavaan is FREE software! Please report any bugs.

if (!require(semPlot)) install.packages("semPlot"); library(semPlot)

Loading required package: semPlot

Simulating Data for the Example

In this section, we generate simulated data for the analysis. The data consist of two groups: men and women, with predefined means, correlations, and standard deviations for a set of variables. We then combine these datasets and prepare them for further analysis.

Setting the Seed

Setting a seed ensures reproducibility so that the same random numbers are generated each time the script is run.

set.seed(20250223)

Defining Sample Sizes

We define the sample sizes for the two groups:

Nmen = 121
Nwomen = 229

Defining Mean Vectors

The mean vectors for men and women are specified for the seven variables:

MeanMEN = c(5.081333347, 9.723030258, 51.94404048, 52.66452548, 34.04606078, 77.06181845, 14.75389904)
MeanWOMEN = c(4.80418631, 10.60486174, 50.34834542, 48.13359134, 30.61321679, 71.77082955, 13.75449003)

These vectors contain the expected means for each variable in the dataset.

Defining the Correlation Matrix

We specify the correlations among the seven variables:

Corr = matrix(c(
  1.0,  .15, .12, .48, .44, .47, .44,
  .15, 1.0, .06, .25, .20, .23, .23,
  .12, .06, 1.0, .40, .32, .19, .14,
  .48, .25, .40, 1.0, .87, .61, .54,
  .44, .20, .32, .87, 1.0, .56, .51,
  .47, .23, .19, .61, .56, 1.0, .70,
  .44, .23, .14, .54, .51, .70, 1.0
), nrow = 7, byrow = TRUE)

Defining Standard Deviations and Constructing the Covariance Matrix

The standard deviations are stored in a vector and used to compute the covariance matrix:

SD = c(1.2, 6.0, 15.2, 16.6, 10.9, 10.5, 2.8)
SDdiag = diag(SD)

# Create the covariance matrix
Cov = SDdiag %*% Corr %*% SDdiag

Generating Multivariate Normal Data

Using the rmvnorm function, we generate normally distributed data for both groups:

xmen = rmvnorm(Nmen, MeanMEN, Cov)
xwomen = rmvnorm(Nwomen, MeanWOMEN, Cov)

Adding Group Identifiers

To distinguish between men and women, we append a group identifier:

xmen = cbind(0, xmen)  # 1 for men
xwomen = cbind(1, xwomen)  # 0 for women

Combining and Formatting the Dataset

The two datasets are combined into a single data frame, and column names are assigned:

allx = as.data.frame(rbind(xmen, xwomen))
names(allx) = c("female", "hsl", "cc", "use", "msc", "mas", "mse", "perf")

At this point, we have a dataset with 350 observations (121 men and 229 women), each with values on seven variables. The variable female acts as a binary indicator (0 = male, 1 = female).

Introducing Missingness in Key Variables

This section introduces missing data into three variables (cc, perf, and use) based on other variables in the dataset. The probability of missingness is modeled using logistic regression equations that depend on female, msc, mas, and mse.

Introducing Missingness in `cc`

We first define a logistic regression model for missingness in cc, using coefficients that determine how each predictor influences the probability of missing data.

M_CC_intercept = -3
M_CC_female = 1
M_CC_msc = .01
M_CC_mas = .01
M_CC_mse = .01

missingCCmodelLogit = 
  M_CC_intercept +
  M_CC_female*allx[,"female"] +
  M_CC_msc*allx[,"msc"] +
  M_CC_mas*allx[,"mas"] +
  M_CC_mse*allx[,"mse"]

missingCCmodelProb = exp(missingCCmodelLogit)/(1+exp(missingCCmodelLogit))

Using these probabilities, we introduce missing values in cc:

makeMissingCC = which(runif(Nmen+Nwomen) < missingCCmodelProb)
allx$cc[makeMissingCC] = NA
allx$cc

  [1] 20.68877296  1.65602752 13.73681033 16.69345233 14.66009497          NA
  [7] 14.58032616 13.96476009          NA          NA -0.83703882 21.18673647
 [13]          NA 26.68356242 14.61730439  9.60319763 17.60507955  2.41473267
 [19] 12.19443765 10.68450120 11.30622665 -1.35699819 11.91915130          NA
 [25] -0.70991074          NA  7.22494813 13.52459626          NA  6.80542846
 [31] -9.00976198          NA 10.82767888  5.63044279 22.51437445 12.83848468
 [37]  4.31661597          NA 16.23420693          NA  2.29597163  2.54519198
 [43]  6.11027746          NA 10.13889582          NA  7.72287315 12.39414602
 [49]          NA  7.45660786 -1.65879328          NA  8.17387658  5.80435467
 [55] 12.22399190  7.63844283          NA 14.18856745  2.96468040          NA
 [61]  4.17224558  9.58928727  4.17705061  7.21909165  7.20645800 15.78044896
 [67]          NA          NA 28.14369310 -3.95927529 18.44520928 18.70995946
 [73] 13.74913339  4.68597920  7.99430875 10.37992360          NA          NA
 [79]          NA  2.71711412 12.93064679          NA  8.63599807          NA
 [85] 14.32572241 13.47978601 18.43434851 14.90355313  3.66127219          NA
 [91]  2.82502647 14.83554489  9.98063535  8.24816285  6.08310985          NA
 [97]          NA  1.39457305  5.62523811 11.75549884  3.93349207 20.45183338
[103] 24.25621031 10.84340723 -2.84450014 13.05872326  9.04898555  2.53536260
[109]  8.64944912 -0.01670160 10.71754742 10.36362085  1.80423382  7.53049158
[115] 15.03613391          NA  4.83098135  7.69556621 -1.08588238          NA
[121] 20.19254523 13.22546293          NA          NA          NA  4.97476537
[127]  5.15342649  2.61963061          NA  7.31788397          NA 15.51772754
[133]  2.54117098          NA 17.52726534 21.71285310  8.84326442          NA
[139] 25.82702399 11.90511587          NA -3.02333067 12.46731186          NA
[145] 13.34106674  2.02597485 14.53714471          NA  9.33414912 -0.84084674
[151]          NA 23.63180122 13.19510328  3.48271022 21.38718621  9.49092808
[157]  3.07328202 21.56445345  1.26162964 -4.94984000          NA          NA
[163]  3.03042720 12.09593584 19.29345513          NA  7.27376992          NA
[169]  5.83638107 14.02980526  9.97739131          NA 13.30099567 14.14158735
[175]          NA  3.17356989          NA  6.02834603 11.30359147 -0.08038210
[181]          NA          NA 11.75849005 11.76825913 10.67281282  7.45948021
[187]  5.40098562 24.51226796 15.11396647  3.69062700 17.22218008 12.21719678
[193]  3.26261970          NA 15.81342167 26.30823576 18.21626595  3.83694005
[199]          NA          NA          NA -2.96432020          NA 15.70318920
[205]  4.21512995          NA 16.67881390 11.74920261 11.81829496 10.91841027
[211] 13.58045227          NA 14.13052072          NA  3.57643827          NA
[217] 18.84649533  8.00503778  7.26123490 16.27713426  7.90041344          NA
[223]          NA 21.47220079 11.62737839  4.16300013          NA  4.43626219
[229]          NA          NA          NA 12.83767154  4.77358290  4.60727157
[235]  6.78529494 16.17905186  5.47012243 10.71751969          NA  9.88217712
[241]  5.19128754          NA 20.12108506 16.74394103 11.74419815          NA
[247]          NA  6.77343122          NA  4.19443908          NA          NA
[253]          NA  4.95320079  1.28836215 12.35997231 15.69213505 13.12414624
[259]  6.05560820 -1.04684220  2.52473359 17.33472307          NA          NA
[265]  3.34179491          NA          NA  3.35071208 14.17605790 16.54374229
[271]          NA  4.89561025          NA 10.02568251          NA  5.64409116
[277]  4.16025019 11.95087151 11.61832686          NA          NA 13.58890687
[283]  9.37765839  8.58732735 14.58023686 -1.03078572  7.35680729          NA
[289]          NA 22.91332725  2.29257932  4.04884083          NA  2.22596863
[295] 14.09941692          NA  6.81444476          NA 14.13808192          NA
[301] 11.92672483 10.95158820  6.59646306          NA 12.60915360 13.38213625
[307]          NA  4.57794368  7.63830798  9.80753680          NA  6.79913995
[313] 14.20393139  2.77927947          NA          NA 17.03332922  0.05252297
[319]          NA 11.43643877 13.19939049          NA 14.49636806          NA
[325]          NA          NA          NA 10.90855017  6.06223292  5.37893438
[331] 12.12554920  6.86094626          NA          NA  6.69547733          NA
[337]          NA  6.69898494          NA          NA          NA -4.94986051
[343] 10.25306957          NA 15.51230101 17.86009476  2.72476148 12.32847625
[349] 14.49991310  3.67101654

Introducing Missingness in `perf`

Next, we define a logistic model for missingness in perf:

M_perf_intercept = -3
M_perf_female = .5
M_perf_msc = .001
M_perf_mas = .02
M_perf_mse = .01

missingPERFmodelLogit = 
  M_perf_intercept +
  M_perf_female*allx[,"female"] +
  M_perf_msc*allx[,"msc"] +
  M_perf_mas*allx[,"mas"] +
  M_perf_mse*allx[,"mse"]

missingPERFmodelProb = exp(missingPERFmodelLogit)/(1+exp(missingPERFmodelLogit))

We then apply this model to introduce missing values in perf:

makeMissingPerf = which(runif(Nmen+Nwomen) < missingPERFmodelProb)
allx$perf[makeMissingPerf] = NA
allx$perf

  [1] 11.776558 20.532909        NA 12.014992        NA 15.273834        NA
  [8] 13.903625 17.686551        NA 13.284943 13.757807 17.605380 17.324344
 [15] 13.413876 17.298330 13.005129 17.396852 14.690130 16.918552        NA
 [22] 18.221247 15.433579 11.323378 16.682367 14.087046 14.028023 18.004349
 [29]        NA 16.864636  5.860316 13.793872        NA 13.611799        NA
 [36] 18.231649 15.850543 15.598164 14.521375 13.505225 14.861601 10.177396
 [43]  9.842593 15.922507 14.214737 18.223424 16.622598 14.881453 12.707240
 [50] 11.150810        NA 19.506276 14.662785 13.761915        NA 13.669431
 [57]        NA 16.006301 15.177665 14.089205        NA 13.496658 11.571074
 [64] 15.938117  9.607525 13.643152  8.961143 13.948205 14.282404 13.523190
 [71] 15.268891 11.201664 18.009158 12.677790 12.430438 16.673858        NA
 [78]        NA 14.445838 14.317263 15.647369 18.784013 12.242757 15.474515
 [85]        NA 13.104960 14.187725 15.377330 13.985991        NA 11.988029
 [92] 13.584972 12.582491        NA 19.975444 13.277893 17.974763 14.165004
 [99] 18.884884 15.960231 12.222217  9.544307 18.087247 16.232268 13.009203
[106] 20.680876 13.612148 12.005250 18.583906 14.768090        NA 14.539714
[113] 12.164501 12.539997        NA 18.912096 12.058256        NA 16.756301
[120] 13.215008 16.670072        NA 12.035312  8.173500        NA 14.686185
[127]  7.700439 11.923834        NA 13.403689        NA 15.293682 11.679634
[134]        NA 15.935398 14.185794 12.636334        NA 14.557324        NA
[141]        NA        NA        NA 10.457553 15.338546  8.624914  9.955488
[148] 16.992809 11.534040 13.853268 15.345128 14.993545        NA 10.702232
[155]        NA 11.251722        NA 14.760888 15.828686 12.465003 15.285171
[162] 18.726137 11.666261        NA 12.036275 15.671001 14.410638 13.231104
[169] 12.169025 12.926418  9.326221 13.238511 13.200097  7.254682        NA
[176]        NA 17.942116 14.779456 14.966809 12.469036        NA  9.130764
[183] 11.250993 12.101902  8.763037 15.089698  9.307276 14.986260        NA
[190]  7.933676 12.651794 13.923861  9.560389 16.752783        NA        NA
[197] 13.199872 15.446361 11.081655        NA        NA 15.735516 10.914531
[204] 16.065137        NA 15.413576 15.431361  6.269605 14.785674        NA
[211]        NA 12.622339        NA 21.155096        NA 14.589418 19.456695
[218] 10.739210 14.796098        NA 12.425648 15.782967 15.110793        NA
[225]        NA 13.369133        NA 12.749807 18.770868        NA 17.661898
[232] 14.806020 12.644463        NA  6.476010        NA 13.944203  8.450241
[239]        NA        NA 11.787318 16.772767 15.571792        NA        NA
[246]        NA 14.217849 11.686289 15.520662 11.394952 14.205548        NA
[253] 17.593349        NA        NA 15.207729        NA 14.042116        NA
[260] 11.609236  9.950430 16.016002 15.438955        NA 12.433468  8.438971
[267]  9.323093 12.448831 12.512473 14.412143 11.607975        NA 16.007915
[274]        NA 13.290832 13.392632 17.388186 17.991244 11.973358 17.417919
[281] 14.793300 12.462660 12.432486 16.661573        NA  9.163985 12.155169
[288] 11.459977 13.470446 16.256209 14.476314        NA 11.375561 18.146211
[295] 15.558148 11.508424 12.433611 18.202976        NA 10.277156 10.529435
[302] 16.978648        NA        NA 13.244885 14.128402 14.199863 10.310462
[309]  9.744907  9.623580        NA 16.472275 15.258739 11.380671 13.898693
[316] 13.991704  8.604307 13.622807 13.507030        NA 11.534204        NA
[323] 11.988585  9.177616        NA        NA 14.537967  8.383236 12.000601
[330]  9.112816 14.905749 12.184384 15.740261 13.987004 18.343706 18.811134
[337] 11.598980 16.541879 15.719937 17.388073 14.645793        NA        NA
[344] 17.109487 15.758917        NA  9.350367 16.758192 11.837373  9.365771

Introducing Missingness in `use`

Finally, we define and apply the missing data model for use:

M_use_intercept = -3
M_use_female = .5
M_use_msc = .001
M_use_mas = .02
M_use_mse = .01

missingUSEmodelLogit = 
  M_use_intercept +
  M_use_female*allx[,"female"] +
  M_use_msc*allx[,"msc"] +
  M_use_mas*allx[,"mas"] +
  M_use_mse*allx[,"mse"]

missingUSEmodelProb = exp(missingUSEmodelLogit)/(1+exp(missingUSEmodelLogit))

Applying the model to introduce missing values in use:

makeMissingUse = which(runif(Nmen+Nwomen) < missingUSEmodelProb)
allx$use[makeMissingUse] = NA
allx$use

  [1]  25.98603  57.94316  31.51112  49.23683  86.03307  49.83644  27.34231
  [8]        NA  57.59374  70.01668  44.48007  49.69265  57.65336  49.41270
 [15]  59.91785        NA  34.80138  96.48215  20.25779  33.61425  47.31303
 [22]        NA  39.34618        NA  55.66349  66.11912  37.38161  75.78362
 [29]  57.12786  54.32838  63.53114  54.11850  13.17825  21.08136  49.59362
 [36]  91.22713  64.22819  58.24329  59.86556  59.54157        NA  58.06283
 [43]  43.19457  52.17066  43.69302        NA        NA  18.74274  42.69099
 [50]  94.35209  58.98487  70.41901  45.44647        NA  53.56132  56.56360
 [57]  61.43773  68.00276        NA        NA        NA  62.42024  36.09353
 [64]  62.78615  44.15615  47.06736  47.92966  27.89581        NA  61.95347
 [71]  76.98301  27.90086  64.01986  55.22360  45.70850  41.97154  78.01587
 [78]  67.51520  42.53303        NA  43.13805        NA  72.99267  48.33413
 [85]        NA  52.75999        NA        NA  67.39213  77.35434  44.20284
 [92]  39.05459  69.43056  39.46969  54.24935  50.73233        NA  59.20456
 [99]  50.44884        NA  25.00361        NA  72.91814  22.79541  23.49984
[106]  58.85973  62.75545  17.59428        NA  35.52535  51.08052        NA
[113]  60.60145  57.22317  35.13498  45.27695  85.59726  47.38416        NA
[120]  44.55703  47.76403  64.57465  41.97907  40.11014        NA  60.30138
[127]  23.65381  34.48606  43.37461  51.13824  59.24258        NA  46.97206
[134]  31.03877  84.09366  45.49572  69.83184  53.83889  57.40198        NA
[141]  34.03711  62.94533  17.25072  44.12457  68.39509  51.37247  32.22518
[148]        NA  74.70326  58.03298  69.38816        NA  56.33521  36.08264
[155]        NA  64.40656        NA  71.21015  57.08863  39.28833  64.31976
[162]  69.00922        NA  62.26705  53.32167        NA  18.23179  66.52969
[169]        NA        NA  44.39014  59.19648  51.16867  44.24391  50.90536
[176]  40.16322  47.69272        NA  54.51617  55.61958  59.72962  47.00937
[183]        NA  25.74983  42.61128  36.05553  26.56331  41.52856  64.48250
[190]  64.12590  53.47216  53.20361        NA        NA  29.71283  64.25734
[197]        NA        NA  49.31666  36.10886  20.57751        NA  29.60830
[204]  12.46264  53.49363        NA  51.26203  54.70825  59.09609        NA
[211]        NA  72.33259        NA  57.32283  66.09566  65.08258  53.04131
[218]        NA  63.60225  27.13580        NA        NA  74.08107  62.58268
[225]  42.32239  39.04382  31.48064  26.58099  38.73084  30.92043        NA
[232]  45.52906  28.75810  61.15427  34.97287  35.72148  39.27290  48.42687
[239]  52.55787  61.29553  25.66332  41.43011  40.98506  44.52377  63.22256
[246]  47.45736  77.41504  39.84078  43.43498  52.59358  68.06403  47.78566
[253]  64.46720  34.37907  73.10910  54.75353  47.39750        NA  40.70835
[260]        NA  30.49188  35.39790        NA  63.15291  32.96431        NA
[267]  43.21773        NA        NA        NA  49.42643        NA  72.97730
[274]  58.83835        NA  52.99675 106.56918        NA  39.83700  59.44301
[281]  38.65360  59.88584  12.74757  57.00097  59.87038  60.50245  58.84270
[288]  43.33702  42.07626        NA        NA  51.87161  49.24108  58.63862
[295]  57.15762  32.34015  35.70952        NA        NA  12.47632        NA
[302]  71.44753  41.48316  38.31634        NA  68.66190  61.32496  43.70253
[309]  48.27531  42.79502  49.01786  38.58815  23.12677  36.91781  79.25350
[316]  62.17726  31.54420  20.91091        NA  59.56096        NA  21.65628
[323]  67.20952  68.70452  66.74332  45.87015        NA  53.00691  95.33053
[330]  65.55181  92.21331        NA        NA  57.02761        NA        NA
[337]        NA  47.53438  22.15236  40.85540  64.25376  52.05446  74.45331
[344]  26.36124  64.01290        NA  57.23752  57.50460  27.45201  52.93909

Converting the Data into a Data Frame

Finally, we store the modified dataset in a new data frame for further analysis:

data02 = as.data.frame(allx)

Centering `cc` and Initial Data Exploration

To improve interpretability, we center the cc variable at 10 and create an interaction term between female and cc10.

Visualizing `cc`

We first generate a boxplot and a histogram to examine the distribution of cc before centering:

par(mfrow = c(1,2))
boxplot(data02$cc, main="Boxplot of College Experience (CC)")
hist(data02$cc, main = "Histogram of College Experience (CC)", xlab = "CC")

par(mfrow = c(1,1))

Centering `cc`

data02$cc10 = data02$cc - 10

Creating an Interaction Term

data02$femXcc10 = data02$female*data02$cc10

Preparing Data for Analysis

We create a subset of the dataset including only relevant variables:

newData = data02[c("perf", "use", "female", "cc10")]

Checking for Completely Missing Rows

allNA = apply(newData, 1, function(x) all(is.na(x)))
which(allNA)

integer(0)

No observations are missing all variables, so we proceed.

Building Analysis Model #1: Empty Model

This model estimates the means, variances, and covariance of perf and use without predictors.

Checking Missing Data Patterns

sum(is.na(newData$perf) & is.na(newData$use))

[1] 12

sum(!is.na(newData$perf) | !is.na(newData$use))

[1] 338

Defining the Model Syntax

model01.syntax = "

#Means:
perf ~ 1
use  ~ 1

#Variances:
perf ~~ perf
use  ~~ use

#Covariance:
perf ~~ use
"

Model Estimation Using `lavaan`

model01.fit = lavaan(model01.syntax, data=data02, estimator = "MLR", mimic = "MPLUS")

Warning: lavaan->lav_data_full():  
   some cases are empty and will be ignored: 61 85 125 140 155 157 210 211 
   213 272 299 346.

summary(model01.fit, standardized=TRUE, fit.measures=TRUE)

lavaan 0.6-19 ended normally after 29 iterations

  Estimator                                         ML
  Optimization method                           NLMINB
  Number of model parameters                         5

                                                  Used       Total
  Number of observations                           338         350
  Number of missing patterns                         3            

Model Test User Model:
                                              Standard      Scaled
  Test Statistic                                 0.000       0.000
  Degrees of freedom                                 0           0

Model Test Baseline Model:

  Test statistic                                10.038      12.196
  Degrees of freedom                                 1           1
  P-value                                        0.002       0.000
  Scaling correction factor                                  0.823

User Model versus Baseline Model:

  Comparative Fit Index (CFI)                    1.000       1.000
  Tucker-Lewis Index (TLI)                       1.000       1.000
                                                                  
  Robust Comparative Fit Index (CFI)                         1.000
  Robust Tucker-Lewis Index (TLI)                            1.000

Loglikelihood and Information Criteria:

  Loglikelihood user model (H0)              -1829.062   -1829.062
  Loglikelihood unrestricted model (H1)      -1829.062   -1829.062
                                                                  
  Akaike (AIC)                                3668.125    3668.125
  Bayesian (BIC)                              3687.240    3687.240
  Sample-size adjusted Bayesian (SABIC)       3671.379    3671.379

Root Mean Square Error of Approximation:

  RMSEA                                          0.000          NA
  90 Percent confidence interval - lower         0.000          NA
  90 Percent confidence interval - upper         0.000          NA
  P-value H_0: RMSEA <= 0.050                       NA          NA
  P-value H_0: RMSEA >= 0.080                       NA          NA
                                                                  
  Robust RMSEA                                               0.000
  90 Percent confidence interval - lower                     0.000
  90 Percent confidence interval - upper                     0.000
  P-value H_0: Robust RMSEA <= 0.050                            NA
  P-value H_0: Robust RMSEA >= 0.080                            NA

Standardized Root Mean Square Residual:

  SRMR                                           0.000       0.000

Parameter Estimates:

  Standard errors                             Sandwich
  Information bread                           Observed
  Observed information based on                Hessian

Covariances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
  perf ~~                                                               
    use               9.908    2.776    3.569    0.000    9.908    0.209

Intercepts:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
    perf             13.837    0.173   79.878    0.000   13.837    4.838
    use              50.802    0.999   50.834    0.000   50.802    3.062

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
    perf              8.179    0.677   12.077    0.000    8.179    1.000
    use             275.265   24.557   11.209    0.000  275.265    1.000

Comparing Model-Implied and Observed Covariance Matrices

fitted(model01.fit)

$cov
        perf     use
perf   8.179        
use    9.908 275.265

$mean
  perf    use 
13.837 50.802

meanVec = colMeans(data02[c("perf", "use")], na.rm = TRUE)
meanVec

    perf      use 
13.84260 50.61093

cov(data02[c("perf", "use")], use = "complete.obs")

          perf       use
perf  8.601417  10.83042
use  10.830424 287.89536

N = sum(!is.na(newData$perf) | !is.na(newData$use))
covMat = cov(data02[c("perf", "use")], use = "complete.obs")*(N-1)/N
covMat

          perf       use
perf  8.575969  10.79838
use  10.798382 287.04360

Inspecting Model Residuals

inspect(model01.fit, what="sampstat.h1")

$cov
        perf     use
perf   8.179        
use    9.908 275.265

$mean
  perf    use 
13.837 50.802

residuals(model01.fit, type = "raw")

$type
[1] "raw"

$cov
     perf use
perf    0    
use     0   0

$mean
perf  use 
   0    0

residuals(model01.fit, type = "normalized")

$type
[1] "normalized"

$cov
     perf use
perf    0    
use     0   0

$mean
perf  use 
   0    0

Checking R-Squared Values

inspect(model01.fit, what="r2")

numeric(0)

Visualizing the Model

We generate path diagrams to visualize the estimated relationships.

semPaths(model01.fit, intercepts = FALSE, residuals = TRUE, style="mx", layout="tree", rotation=2,
         optimizeLatRes = TRUE, sizeMan = 8, what ="path")

semPaths(model01.fit, intercepts = FALSE, residuals = TRUE, style="mx", layout="tree", rotation=2,
         optimizeLatRes = TRUE, sizeMan = 8, whatLabels ="est")

Verifying Log-Likelihood Calculation

In this section, we compare the log-likelihood calculated manually with the one obtained from lavaan to confirm that they match.

Manual Log-Likelihood Calculation

We first extract complete cases and compute the log-likelihood manually:

completeData = data02[, c("perf", "use")]

# remove cases that are missing on all variables
completeData = completeData[!is.na(completeData$perf) | !is.na(completeData$use),]
logLike = rep(NA, nrow(completeData))
for (obs in 1:nrow(completeData)){
  obsData = completeData[obs,]
  obsVars = names(completeData)[which(!is.na(obsData))]
  logLike[obs] = dmvnorm(
    x = obsData[obsVars], 
    mean = meanVec[obsVars], 
    sigma = covMat[obsVars, obsVars, drop=FALSE], 
    log = TRUE
  )
}

# our calculation
sum(logLike)

[1] -1829.344

Log-Likelihood from `lavaan`

sum(inspect(model01.fit, what="loglik.casewise"), na.rm = TRUE)

[1] -1829.062

Comparing Factored Regression Specifications

We now compare different factored regression specifications to confirm they yield the same log-likelihood.

Model 1b: `perf ~ use` and `use ~ 1`

model01b.syntax = "

# exogenous mean:
use ~ 1

# exogenous variance:
use ~~ use

# regression:
perf ~ 1 + use

# residual variance:
perf ~~ perf
"

model01b.fit = lavaan(model01b.syntax, data=data02, estimator = "MLR", mimic = "MPLUS")

Warning: lavaan->lav_data_full():  
   some cases are empty and will be ignored: 61 85 125 140 155 157 210 211 
   213 272 299 346.

summary(model01b.fit, standardized=TRUE, fit.measures=TRUE)

lavaan 0.6-19 ended normally after 16 iterations

  Estimator                                         ML
  Optimization method                           NLMINB
  Number of model parameters                         5

                                                  Used       Total
  Number of observations                           338         350
  Number of missing patterns                         3            

Model Test User Model:
                                              Standard      Scaled
  Test Statistic                                 0.000       0.000
  Degrees of freedom                                 0           0

Model Test Baseline Model:

  Test statistic                                10.038      12.196
  Degrees of freedom                                 1           1
  P-value                                        0.002       0.000
  Scaling correction factor                                  0.823

User Model versus Baseline Model:

  Comparative Fit Index (CFI)                    1.000       1.000
  Tucker-Lewis Index (TLI)                       1.000       1.000
                                                                  
  Robust Comparative Fit Index (CFI)                         1.000
  Robust Tucker-Lewis Index (TLI)                            1.000

Loglikelihood and Information Criteria:

  Loglikelihood user model (H0)              -1829.062   -1829.062
  Loglikelihood unrestricted model (H1)      -1829.062   -1829.062
                                                                  
  Akaike (AIC)                                3668.125    3668.125
  Bayesian (BIC)                              3687.240    3687.240
  Sample-size adjusted Bayesian (SABIC)       3671.379    3671.379

Root Mean Square Error of Approximation:

  RMSEA                                          0.000          NA
  90 Percent confidence interval - lower         0.000          NA
  90 Percent confidence interval - upper         0.000          NA
  P-value H_0: RMSEA <= 0.050                       NA          NA
  P-value H_0: RMSEA >= 0.080                       NA          NA
                                                                  
  Robust RMSEA                                               0.000
  90 Percent confidence interval - lower                     0.000
  90 Percent confidence interval - upper                     0.000
  P-value H_0: Robust RMSEA <= 0.050                            NA
  P-value H_0: Robust RMSEA >= 0.080                            NA

Standardized Root Mean Square Residual:

  SRMR                                           0.000       0.000

Parameter Estimates:

  Standard errors                             Sandwich
  Information bread                           Observed
  Observed information based on                Hessian

Regressions:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
  perf ~                                                                
    use               0.036    0.010    3.648    0.000    0.036    0.209

Intercepts:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
    use              50.802    0.999   50.834    0.000   50.802    3.062
   .perf             12.008    0.527   22.766    0.000   12.008    4.199

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
    use             275.265   24.557   11.209    0.000  275.265    1.000
   .perf              7.822    0.670   11.682    0.000    7.822    0.956

Model 1c: `use ~ perf` and `perf ~ 1`

model01c.syntax = "

# exogenous mean:
perf ~ 1

# exogenous variance:
perf ~~ perf

# regression:
use ~ 1 + perf

# residual variance:
use ~~ use
"

model01c.fit = lavaan(model01c.syntax, data=data02, estimator = "MLR", mimic = "MPLUS")

Warning: lavaan->lav_data_full():  
   some cases are empty and will be ignored: 61 85 125 140 155 157 210 211 
   213 272 299 346.

summary(model01c.fit, standardized=TRUE, fit.measures=TRUE)

lavaan 0.6-19 ended normally after 20 iterations

  Estimator                                         ML
  Optimization method                           NLMINB
  Number of model parameters                         5

                                                  Used       Total
  Number of observations                           338         350
  Number of missing patterns                         3            

Model Test User Model:
                                              Standard      Scaled
  Test Statistic                                 0.000       0.000
  Degrees of freedom                                 0           0

Model Test Baseline Model:

  Test statistic                                10.038      12.196
  Degrees of freedom                                 1           1
  P-value                                        0.002       0.000
  Scaling correction factor                                  0.823

User Model versus Baseline Model:

  Comparative Fit Index (CFI)                    1.000       1.000
  Tucker-Lewis Index (TLI)                       1.000       1.000
                                                                  
  Robust Comparative Fit Index (CFI)                         1.000
  Robust Tucker-Lewis Index (TLI)                            1.000

Loglikelihood and Information Criteria:

  Loglikelihood user model (H0)              -1829.062   -1829.062
  Loglikelihood unrestricted model (H1)      -1829.062   -1829.062
                                                                  
  Akaike (AIC)                                3668.125    3668.125
  Bayesian (BIC)                              3687.240    3687.240
  Sample-size adjusted Bayesian (SABIC)       3671.379    3671.379

Root Mean Square Error of Approximation:

  RMSEA                                          0.000          NA
  90 Percent confidence interval - lower         0.000          NA
  90 Percent confidence interval - upper         0.000          NA
  P-value H_0: RMSEA <= 0.050                       NA          NA
  P-value H_0: RMSEA >= 0.080                       NA          NA
                                                                  
  Robust RMSEA                                               0.000
  90 Percent confidence interval - lower                     0.000
  90 Percent confidence interval - upper                     0.000
  P-value H_0: Robust RMSEA <= 0.050                            NA
  P-value H_0: Robust RMSEA >= 0.080                            NA

Standardized Root Mean Square Residual:

  SRMR                                           0.000       0.000

Parameter Estimates:

  Standard errors                             Sandwich
  Information bread                           Observed
  Observed information based on                Hessian

Regressions:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
  use ~                                                                 
    perf              1.211    0.337    3.590    0.000    1.211    0.209

Intercepts:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
    perf             13.837    0.173   79.878    0.000   13.837    4.838
   .use              34.040    4.679    7.275    0.000   34.040    2.052

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
    perf              8.179    0.677   12.077    0.000    8.179    1.000
   .use             263.262   23.850   11.038    0.000  263.262    0.956

Model 2: Adding Predictors as Main Effects

In this section, we expand our analysis by adding predictors (female and cc10) as main effects in the regression models for perf and use.

Specifying the Model Syntax

We define a structural equation model where perf and use are regressed on female and cc10. The variances and covariances of perf and use are also estimated.

model02.syntax = "

#Means:
perf ~ 1 + female + cc10
use  ~ 1 + female + cc10

#Variances:
perf ~~ perf
use  ~~ use

#Covariance:
perf ~~ use
"

Estimating the Model

We fit the model using the lavaan package with the MLR estimator and MPLUS mimic settings.

model02.fit = lavaan(model02.syntax, data=data02, estimator = "MLR", mimic = "MPLUS")

Warning: lavaan->lav_data_full():  
   105 cases were deleted due to missing values in exogenous variable(s), 
   while fixed.x = TRUE.

Summarizing the Model Results

We generate a summary of the model fit, including standardized estimates and fit indices.

summary(model02.fit, standardized=TRUE, fit.measures=TRUE)

lavaan 0.6-19 ended normally after 46 iterations

  Estimator                                         ML
  Optimization method                           NLMINB
  Number of model parameters                         9

                                                  Used       Total
  Number of observations                           245         350
  Number of missing patterns                         4            

Model Test User Model:
                                              Standard      Scaled
  Test Statistic                                 0.000       0.000
  Degrees of freedom                                 0           0

Model Test Baseline Model:

  Test statistic                                30.301      31.752
  Degrees of freedom                                 5           5
  P-value                                        0.000       0.000
  Scaling correction factor                                  0.954

User Model versus Baseline Model:

  Comparative Fit Index (CFI)                    1.000       1.000
  Tucker-Lewis Index (TLI)                       1.000       1.000
                                                                  
  Robust Comparative Fit Index (CFI)                         1.000
  Robust Tucker-Lewis Index (TLI)                            1.000

Loglikelihood and Information Criteria:

  Loglikelihood user model (H0)              -1275.323   -1275.323
  Loglikelihood unrestricted model (H1)      -1275.323   -1275.323
                                                                  
  Akaike (AIC)                                2568.645    2568.645
  Bayesian (BIC)                              2600.157    2600.157
  Sample-size adjusted Bayesian (SABIC)       2571.627    2571.627

Root Mean Square Error of Approximation:

  RMSEA                                          0.000          NA
  90 Percent confidence interval - lower         0.000          NA
  90 Percent confidence interval - upper         0.000          NA
  P-value H_0: RMSEA <= 0.050                       NA          NA
  P-value H_0: RMSEA >= 0.080                       NA          NA
                                                                  
  Robust RMSEA                                               0.000
  90 Percent confidence interval - lower                     0.000
  90 Percent confidence interval - upper                     0.000
  P-value H_0: Robust RMSEA <= 0.050                            NA
  P-value H_0: Robust RMSEA >= 0.080                            NA

Standardized Root Mean Square Residual:

  SRMR                                           0.000       0.000

Parameter Estimates:

  Standard errors                             Sandwich
  Information bread                           Observed
  Observed information based on                Hessian

Regressions:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
  perf ~                                                                
    female           -1.586    0.384   -4.135    0.000   -1.586   -0.275
    cc10              0.081    0.030    2.721    0.007    0.081    0.192
  use ~                                                                 
    female           -0.824    2.598   -0.317    0.751   -0.824   -0.023
    cc10             -0.025    0.171   -0.145    0.885   -0.025   -0.009

Covariances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
 .perf ~~                                                               
   .use               8.676    3.326    2.608    0.009    8.676    0.190

Intercepts:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
   .perf             14.541    0.291   49.940    0.000   14.541    5.204
   .use              50.868    2.099   24.230    0.000   50.868    2.941

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
   .perf              6.943    0.706    9.840    0.000    6.943    0.889
   .use             298.866   32.836    9.102    0.000  298.866    0.999

Inspecting Model-Implied Mean Vector and Covariance Matrix

The fitted() function provides the expected mean vector and covariance matrix under the estimated model.

fitted(model02.fit)

$cov
          perf     use  female    cc10
perf     7.809                        
use      8.895 299.055                
female  -0.369  -0.195   0.236        
cc10     3.447  -1.120   0.057  43.418

$mean
  perf    use female   cc10 
13.528 50.365  0.620 -0.352

Examining the Saturated Model Statistics

The inspect() function allows us to see the sample statistics from the fully saturated model.

inspect(model02.fit, what="sampstat.h1")

$cov
          perf     use  female    cc10
perf     7.809                        
use      8.895 299.055                
female  -0.369  -0.195   0.236        
cc10     3.447  -1.120   0.057  43.418

$mean
  perf    use female   cc10 
13.528 50.365  0.620 -0.352

Assessing Residuals

We check the raw residuals to assess the discrepancy between model-implied and observed covariance structures.

residuals(model02.fit, type = "raw")

$type
[1] "raw"

$cov
       perf use female cc10
perf      0                
use       0   0            
female    0   0      0     
cc10      0   0      0    0

$mean
  perf    use female   cc10 
     0      0      0      0

We also examine the normalized residuals to further evaluate model fit.

residuals(model02.fit, type = "normalized")

$type
[1] "normalized"

$cov
       perf use female cc10
perf      0                
use       0   0            
female    0   0      0     
cc10      0   0      0    0

$mean
  perf    use female   cc10 
     0      0      0      0

Checking Modification Indices

The modindices() function provides information about potential model improvements by suggesting additional parameters.

modindices(model02.fit)

[1] lhs      op       rhs      mi       epc      sepc.lv  sepc.all sepc.nox
<0 rows> (or 0-length row.names)

Calculating R-Squared Values for Dependent Variables

We inspect the proportion of variance explained for the dependent variables in the model.

inspect(model02.fit, what="r2")

 perf   use 
0.111 0.001

Visualizing the Model

Finally, we create path diagrams to illustrate the relationships in the estimated model.

Path Diagram Without Estimates

semPaths(model02.fit, intercepts = FALSE, residuals = TRUE, style="mx", layout="tree", rotation=2,
         optimizeLatRes = TRUE, sizeMan = 8, what ="path")

Path Diagram With Estimates

semPaths(model02.fit, intercepts = FALSE, residuals = TRUE, style="mx", layout="tree", rotation=2,
         optimizeLatRes = TRUE, sizeMan = 8, whatLabels ="est")

Model 3: Expanding the Likelihood Function

In this section, we expand the likelihood function by incorporating more variables into the estimation process.

Specifying the Model Syntax

This model includes cc10 as an exogenous variable and estimates its mean and variance. We also specify regressions for perf and use with female and cc10 as predictors, while maintaining variance and covariance structures.

model03.syntax = "
# Exogenous Variables 
#Mean:
cc10 ~ 1

#Variance:
cc10 ~~ cc10

# Endogenous Variables 
#Regressions:
perf ~ 1 + female + cc10
use  ~ 1 + female + cc10

#Residual Variances:
perf ~~ perf
use  ~~ use

#Residual Covariance:
perf ~~ use
"

Estimating the Model

We fit the expanded model using lavaan, ensuring that there are no warnings related to missing data.

model03.fit = lavaan(model03.syntax, data=data02, estimator = "MLR", mimic = "MPLUS")

Summarizing the Model Results

We extract and examine the model summary, including standardized estimates and fit measures.

summary(model03.fit, standardized=TRUE, fit.measures=TRUE)

lavaan 0.6-19 ended normally after 53 iterations

  Estimator                                         ML
  Optimization method                           NLMINB
  Number of model parameters                        11

  Number of observations                           350
  Number of missing patterns                         8

Model Test User Model:
                                              Standard      Scaled
  Test Statistic                                 0.157       0.148
  Degrees of freedom                                 1           1
  P-value (Chi-square)                           0.692       0.700
  Scaling correction factor                                  1.059
    Yuan-Bentler correction (Mplus variant)                       

Model Test Baseline Model:

  Test statistic                                32.478      33.840
  Degrees of freedom                                 6           6
  P-value                                        0.000       0.000
  Scaling correction factor                                  0.960

User Model versus Baseline Model:

  Comparative Fit Index (CFI)                    1.000       1.000
  Tucker-Lewis Index (TLI)                       1.191       1.184
                                                                  
  Robust Comparative Fit Index (CFI)                         1.000
  Robust Tucker-Lewis Index (TLI)                            1.181

Loglikelihood and Information Criteria:

  Loglikelihood user model (H0)              -2627.493   -2627.493
  Scaling correction factor                                  0.970
      for the MLR correction                                      
  Loglikelihood unrestricted model (H1)      -2627.415   -2627.415
  Scaling correction factor                                  0.978
      for the MLR correction                                      
                                                                  
  Akaike (AIC)                                5276.987    5276.987
  Bayesian (BIC)                              5319.424    5319.424
  Sample-size adjusted Bayesian (SABIC)       5284.528    5284.528

Root Mean Square Error of Approximation:

  RMSEA                                          0.000       0.000
  90 Percent confidence interval - lower         0.000       0.000
  90 Percent confidence interval - upper         0.105       0.099
  P-value H_0: RMSEA <= 0.050                    0.797       0.816
  P-value H_0: RMSEA >= 0.080                    0.106       0.092
                                                                  
  Robust RMSEA                                               0.000
  90 Percent confidence interval - lower                     0.000
  90 Percent confidence interval - upper                     0.125
  P-value H_0: Robust RMSEA <= 0.050                         0.775
  P-value H_0: Robust RMSEA >= 0.080                         0.144

Standardized Root Mean Square Residual:

  SRMR                                           0.007       0.007

Parameter Estimates:

  Standard errors                             Sandwich
  Information bread                           Observed
  Observed information based on                Hessian

Regressions:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
  perf ~                                                                
    female           -1.334    0.338   -3.944    0.000   -1.334   -0.222
    cc10              0.088    0.031    2.847    0.004    0.088    0.204
  use ~                                                                 
    female           -2.036    2.108   -0.966    0.334   -2.036   -0.058
    cc10             -0.007    0.156   -0.048    0.962   -0.007   -0.003

Covariances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
 .perf ~~                                                               
   .use               9.385    2.761    3.399    0.001    9.385    0.208

Intercepts:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
    cc10             -0.227    0.424   -0.535    0.593   -0.227   -0.034
   .perf             14.712    0.261   56.287    0.000   14.712    5.145
   .use              52.122    1.721   30.284    0.000   52.122    3.140

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
    cc10             43.505    3.841   11.326    0.000   43.505    1.000
   .perf              7.436    0.634   11.734    0.000    7.436    0.909
   .use             274.517   24.516   11.197    0.000  274.517    0.997

Inspecting Model-Implied Mean Vector and Covariance Matrix

We check the expected mean vector and covariance matrix under the estimated model.

fitted(model03.fit)

$cov
          perf     use  female    cc10
perf     8.177                        
use      9.971 275.457                
female  -0.302  -0.461   0.226        
cc10     3.841  -0.324   0.000  43.505

$mean
  perf    use female   cc10 
13.819 50.792  0.654 -0.227

Examining the Saturated Model Statistics

We inspect the sample statistics from the fully saturated model.

inspect(model03.fit, what="sampstat.h1")

$cov
          perf     use  female    cc10
perf     8.163                        
use      9.950 275.464                
female  -0.297  -0.461   0.226        
cc10     3.755  -0.576   0.078  43.520

$mean
  perf    use female   cc10 
13.819 50.791  0.654 -0.215

Assessing Residuals

We evaluate the raw residuals to assess model fit discrepancies.

residuals(model03.fit, type = "raw")

$type
[1] "raw"

$cov
         perf    use female   cc10
perf   -0.014                     
use    -0.021  0.007              
female  0.005 -0.001  0.000       
cc10   -0.086 -0.252  0.078  0.015

$mean
  perf    use female   cc10 
 0.000  0.000  0.000  0.012

We also check the normalized residuals for further model diagnostics.

residuals(model03.fit, type = "normalized")

$type
[1] "normalized"

$cov
         perf    use female   cc10
perf   -0.021                     
use    -0.008  0.000              
female  0.066 -0.001  0.000       
cc10   -0.062 -0.037  0.385  0.004

$mean
  perf    use female   cc10 
 0.000  0.000  0.000  0.028

Checking Modification Indices

The modindices() function suggests potential model improvements by identifying additional parameters to estimate.

modindices(model03.fit)

      lhs op    rhs    mi    epc sepc.lv sepc.all sepc.nox
18 female  ~   perf 0.163  0.021   0.021    0.126    0.126
19 female  ~    use 0.163 -0.248  -0.248   -8.665   -8.665
20 female  ~   cc10 0.163  0.002   0.002    0.026    0.002
21   cc10  ~   perf 0.163 -0.267  -0.267   -0.116   -0.116
22   cc10  ~    use 0.163 -0.175  -0.175   -0.440   -0.440
23   cc10  ~ female 0.163  0.356   0.356    0.026    0.054

Calculating R-Squared Values for Dependent Variables

We compute the proportion of variance explained for the dependent variables in the model.

inspect(model03.fit, what="r2")

 perf   use 
0.091 0.003

Visualizing the Model

We generate path diagrams to illustrate the estimated relationships.

Path Diagram Without Estimates

semPaths(model03.fit, intercepts = FALSE, residuals = TRUE, style="mx", layout="tree", rotation=2,
         optimizeLatRes = TRUE, sizeMan = 8, what ="path")

Path Diagram With Estimates

semPaths(model03.fit, intercepts = FALSE, residuals = TRUE, style="mx", layout="tree", rotation=2,
         optimizeLatRes = TRUE, sizeMan = 8, whatLabels ="est")

Handling Missing Data in a Categorical Variable (`female`)

We introduce missingness in female completely at random (MCAR) to assess its impact on model estimation.

Creating Missing Values in `female`

We randomly introduce missing values in 10% of observations for female.

data03 = data02
data03$female[sample(x = 1:nrow(data03), replace = TRUE, size = .1*nrow(data03) )] = NA
data03$female

  [1]  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 NA  0  0 NA NA  0  0
 [26] NA  0  0  0  0  0  0  0  0  0  0  0  0 NA  0  0  0  0  0  0  0  0  0  0  0
 [51]  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 NA  0  0  0  0  0  0  0  0
 [76]  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 NA  0  0  0  0
[101]  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 NA  0  0  0  0  1  1  1  1
[126]  1  1  1 NA  1  1  1  1  1  1  1  1 NA  1  1 NA  1  1  1  1  1  1  1  1  1
[151]  1  1  1 NA  1  1 NA  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
[176]  1  1  1  1  1  1  1 NA  1  1  1  1  1  1  1  1  1  1 NA  1  1  1  1 NA  1
[201]  1  1  1  1  1  1  1  1  1  1  1 NA  1  1  1  1  1 NA  1  1  1  1  1  1  1
[226] NA  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1 NA
[251]  1  1  1  1  1  1  1  1  1  1 NA  1 NA  1  1  1  1  1  1  1  1  1 NA  1  1
[276]  1  1  1  1  1  1  1  1  1  1  1  1  1  1 NA NA NA  1  1  1  1  1  1  1  1
[301]  1  1  1  1  1  1 NA  1  1  1  1 NA  1  1  1  1  1  1  1 NA NA  1  1  1  1
[326]  1  1  1  1  1 NA  1  1  1  1  1  1 NA  1 NA  1  1  1  1  1  1  1  1  1  1

Running the Model with Missing `female`

We attempt to estimate model03.syntax on the modified dataset, expecting a warning about missing data since female is not explicitly included in the likelihood function.

model03b.fit = lavaan(model03.syntax, data=data03, estimator = "MLR", mimic = "MPLUS")

Warning: lavaan->lav_data_full():  
   33 cases were deleted due to missing values in exogenous variable(s), 
   while fixed.x = TRUE.

Model 4: Adding `female` to the Likelihood Function

In this section, we extend the likelihood function to explicitly model female as an exogenous variable alongside cc10.

Specifying the Model Syntax

This model includes female as an exogenous variable, estimating its mean and variance, as well as its covariance with cc10. The endogenous variables perf and use are regressed on both female and cc10.

model04.syntax = "

# Exogenous Variables 
#Mean:
cc10 ~ 1
female ~ 1

#Variance:
cc10 ~~ cc10
female ~~ female

#Covariance:
cc10 ~~ female

# Endogenous Variables 
#Regressions:
perf ~ 1 + female + cc10
use  ~ 1 + female + cc10

#Residual Variances:
perf ~~ perf
use  ~~ use

#Residual Covariance:
perf ~~ use
"

Estimating the Model

We fit the model using lavaan, incorporating female into the likelihood function.

model04.fit = lavaan(model04.syntax, data=data03, estimator = "MLR", mimic = "MPLUS")

Warning: lavaan->lav_data_full():  
   some observed variances are (at least) a factor 1000 times larger than 
   others; use varTable(fit) to investigate

Summarizing the Model Results

We extract and examine the model summary, including standardized estimates and fit measures.

summary(model04.fit, standardized=TRUE, fit.measures=TRUE)

lavaan 0.6-19 ended normally after 57 iterations

  Estimator                                         ML
  Optimization method                           NLMINB
  Number of model parameters                        14

  Number of observations                           350
  Number of missing patterns                        15

Model Test User Model:
                                              Standard      Scaled
  Test Statistic                                 0.000       0.000
  Degrees of freedom                                 0           0

Model Test Baseline Model:

  Test statistic                                33.239      34.571
  Degrees of freedom                                 6           6
  P-value                                        0.000       0.000
  Scaling correction factor                                  0.961

User Model versus Baseline Model:

  Comparative Fit Index (CFI)                    1.000       1.000
  Tucker-Lewis Index (TLI)                       1.000       1.000
                                                                  
  Robust Comparative Fit Index (CFI)                         1.000
  Robust Tucker-Lewis Index (TLI)                            1.000

Loglikelihood and Information Criteria:

  Loglikelihood user model (H0)              -2843.479   -2843.479
  Loglikelihood unrestricted model (H1)      -2843.479   -2843.479
                                                                  
  Akaike (AIC)                                5714.958    5714.958
  Bayesian (BIC)                              5768.969    5768.969
  Sample-size adjusted Bayesian (SABIC)       5724.556    5724.556

Root Mean Square Error of Approximation:

  RMSEA                                          0.000          NA
  90 Percent confidence interval - lower         0.000          NA
  90 Percent confidence interval - upper         0.000          NA
  P-value H_0: RMSEA <= 0.050                       NA          NA
  P-value H_0: RMSEA >= 0.080                       NA          NA
                                                                  
  Robust RMSEA                                               0.000
  90 Percent confidence interval - lower                     0.000
  90 Percent confidence interval - upper                     0.000
  P-value H_0: Robust RMSEA <= 0.050                            NA
  P-value H_0: Robust RMSEA >= 0.080                            NA

Standardized Root Mean Square Residual:

  SRMR                                           0.000       0.000

Parameter Estimates:

  Standard errors                             Sandwich
  Information bread                           Observed
  Observed information based on                Hessian

Regressions:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
  perf ~                                                                
    female           -1.423    0.348   -4.091    0.000   -1.423   -0.238
    cc10              0.089    0.031    2.840    0.005    0.089    0.205
  use ~                                                                 
    female           -1.917    2.202   -0.871    0.384   -1.917   -0.055
    cc10             -0.010    0.156   -0.066    0.948   -0.010   -0.004

Covariances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
  female ~~                                                             
    cc10              0.103    0.209    0.493    0.622    0.103    0.033
 .perf ~~                                                               
   .use               9.374    2.794    3.355    0.001    9.374    0.208

Intercepts:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
    cc10             -0.211    0.423   -0.498    0.618   -0.211   -0.032
    female            0.643    0.027   23.967    0.000    0.643    1.344
   .perf             14.757    0.263   56.183    0.000   14.757    5.165
   .use              52.026    1.746   29.800    0.000   52.026    3.135

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
    cc10             43.500    3.838   11.335    0.000   43.500    1.000
    female            0.229    0.008   29.729    0.000    0.229    1.000
   .perf              7.382    0.633   11.653    0.000    7.382    0.904
   .use             274.611   24.485   11.215    0.000  274.611    0.997

Inspecting Model-Implied Mean Vector and Covariance Matrix

We check the expected mean vector and covariance matrix under the estimated model.

fitted(model04.fit)

$cov
          perf     use  female    cc10
perf     8.162                        
use      9.943 275.462                
female  -0.317  -0.440   0.229        
cc10     3.711  -0.645   0.103  43.500

$mean
  perf    use female   cc10 
13.823 50.795  0.643 -0.211

Examining the Saturated Model Statistics

We inspect the sample statistics from the fully saturated model.

inspect(model04.fit, what="sampstat.h1")

$cov
          perf     use  female    cc10
perf     8.162                        
use      9.943 275.462                
female  -0.317  -0.440   0.229        
cc10     3.711  -0.645   0.103  43.500

$mean
  perf    use female   cc10 
13.823 50.795  0.643 -0.211

Assessing Residuals

We evaluate the raw residuals to assess model fit discrepancies.

residuals(model04.fit, type = "raw")

$type
[1] "raw"

$cov
       perf use female cc10
perf      0                
use       0   0            
female    0   0      0     
cc10      0   0      0    0

$mean
  perf    use female   cc10 
     0      0      0      0

We also check the normalized residuals for further model diagnostics.

residuals(model04.fit, type = "normalized")

$type
[1] "normalized"

$cov
       perf use female cc10
perf      0                
use       0   0            
female    0   0      0     
cc10      0   0      0    0

$mean
  perf    use female   cc10 
     0      0      0      0

Checking Modification Indices

The modindices() function suggests potential model improvements by identifying additional parameters to estimate.

modindices(model04.fit)

[1] lhs      op       rhs      mi       epc      sepc.lv  sepc.all sepc.nox
<0 rows> (or 0-length row.names)

Calculating R-Squared Values for Dependent Variables

We compute the proportion of variance explained for the dependent variables in the model.

inspect(model04.fit, what="r2")

 perf   use 
0.096 0.003

Visualizing the Model

We generate path diagrams to illustrate the estimated relationships.

Path Diagram Without Estimates

semPaths(model04.fit, intercepts = FALSE, residuals = TRUE, style="mx", layout="tree", rotation=2,
         optimizeLatRes = TRUE, sizeMan = 8, what ="path")

Path Diagram With Estimates

semPaths(model04.fit, intercepts = FALSE, residuals = TRUE, style="mx", layout="tree", rotation=2,
         optimizeLatRes = TRUE, sizeMan = 8, whatLabels ="est")

Addressing the Normality Issue with `female`

Since female is not normally distributed, we will need to use factored regression and the EM algorithm to correct for this issue in subsequent models.

Handling Non-Normal `female` Using Factored Regression and the EM Algorithm

In this section, we use factored regression and the Expectation-Maximization (EM) algorithm to handle the non-normal distribution of female. The factorization follows:

\[f(use, perf, cc10, female) = f(use|perf, cc10, female) * f(perf|cc10, female) * f(cc10|female) * f(female)\]

Initializing Model Parameters

We set initial values for the regression coefficients in the factored regression framework.

# f(female):
female_intercept = 0

# f(cc10|female) 
cc10_intercept = 0
cc10_female = 1

# f(perf|cc10, female):
perf_intercept = 0
perf_cc10 = 1
perf_female = 1

# f(use|perf, cc10, female):
use_intercept = 0
use_perf = 1
use_cc10 = 1
use_female = 1

Identifying Missing Data

We determine which observations have missing values in each variable.

femaleMissing = which(is.na(data03$female))
cc10Missing = which(is.na(data03$cc10))
perfMissing = which(is.na(data03$perf))
useMissing = which(is.na(data03$use))

Setting Up the EM Algorithm

We create a working copy of the dataset and set the number of iterations.

algorithmData = data03
maxIterations = 10
iteration = 1

EM Algorithm: Expectation and Maximization Steps

We iterate through the EM process, imputing missing data in the E-step and re-estimating parameters in the M-step.

while(iteration < maxIterations){
  # E-step: Impute missing values
  algorithmData$female[femaleMissing] = 
    round(exp(female_intercept)/(1+exp(female_intercept)), 0)
  
  algorithmData$cc10[cc10Missing] = 
    cc10_intercept + 
    cc10_female*algorithmData$female[cc10Missing]
  
  algorithmData$perf[perfMissing] =
    perf_intercept + 
    perf_cc10*algorithmData$cc10[perfMissing] + 
    perf_female*algorithmData$female[perfMissing]
  
  algorithmData$use[useMissing] =
    use_intercept + 
    use_perf*algorithmData$perf[useMissing] + 
    use_cc10*algorithmData$cc10[useMissing] +
    use_female*algorithmData$female[useMissing]
  
  # M-step: Update regression coefficients
  femaleModel = glm(female ~ 1, data = algorithmData, family = binomial(link="logit"))
  female_intercept = femaleModel$coefficients[1]
  logLikeFemale = dbinom(
    x = algorithmData$female, 
    size = 1, 
    prob = exp(female_intercept)/(1+exp(female_intercept)), 
    log = TRUE
  )
  
  cc10Model = lm(cc10 ~ female, data = algorithmData)
  cc10_intercept = cc10Model$coefficients["(Intercept)"]
  cc10_female = cc10Model$coefficients["female"]
  cc10_residVar = anova(cc10Model)$`Mean Sq`[2]
  logLikeCC10 = dnorm(
    x = algorithmData$cc10, 
    mean = cc10_intercept + cc10_female*algorithmData$female,
    sd = sqrt(cc10_residVar),
    log = TRUE
  )
  
  perfModel = lm(perf ~ cc10 + female, data = algorithmData)
  perf_intercept = perfModel$coefficients["(Intercept)"]
  perf_cc10 = perfModel$coefficients["cc10"]
  perf_female = perfModel$coefficients["female"]
  perf_residVar = anova(perfModel)$`Mean Sq`[2]
  logLikePerf = dnorm(
    x = algorithmData$perf, 
    mean = perf_intercept + perf_cc10*algorithmData$cc10 + perf_female*algorithmData$female,
    sd = sqrt(perf_residVar),
    log = TRUE
  )
  
  useModel = lm(use ~ perf + cc10 + female, data = algorithmData)
  use_intercept = useModel$coefficients["(Intercept)"]
  use_perf = useModel$coefficients["perf"]
  use_cc10 = useModel$coefficients["cc10"]
  use_female = useModel$coefficients["female"]
  use_residVar = anova(useModel)$`Mean Sq`[2]
  logLikeUse = dnorm(
    x = algorithmData$use, 
    mean = use_intercept + use_perf*algorithmData$perf + use_cc10*algorithmData$cc10 + use_female*algorithmData$female,
    sd = sqrt(use_residVar),
    log = TRUE
  )
  
  # Compute total log-likelihood
  logLikeMatrix = cbind(
    logLikeFemale,
    logLikeCC10,
    logLikePerf,
    logLikeUse
  )
  loglik = sum(rowSums(logLikeMatrix))
  
  # Print iteration number and log-likelihood
  cat("Iteration: ", iteration, "Loglikelihood: ", loglik, "\n")
  iteration = iteration + 1
}

Iteration:  1 Loglikelihood:  -4261.033 
Iteration:  2 Loglikelihood:  -7431.362 
Iteration:  3 Loglikelihood:  -4041.548 
Iteration:  4 Loglikelihood:  -3955.812 
Iteration:  5 Loglikelihood:  -3942.861 
Iteration:  6 Loglikelihood:  -3939.976 
Iteration:  7 Loglikelihood:  -3939.23 
Iteration:  8 Loglikelihood:  -3939.026 
Iteration:  9 Loglikelihood:  -3938.97

Summarizing the Final Models

After running the EM algorithm, we summarize the estimated models.

summary(femaleModel)


Call:
glm(formula = female ~ 1, family = binomial(link = "logit"), 
    data = algorithmData)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)   0.7407     0.1143   6.479 9.24e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 440.3  on 349  degrees of freedom
Residual deviance: 440.3  on 349  degrees of freedom
AIC: 442.3

Number of Fisher Scoring iterations: 4

summary(cc10Model)


Call:
lm(formula = cc10 ~ female, data = algorithmData)

Residuals:
     Min       1Q   Median       3Q      Max 
-18.5496  -2.9193   0.0002   2.5035  18.6038 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  -0.4602     0.5201  -0.885    0.377
female        0.1682     0.6320   0.266    0.790

Residual standard error: 5.528 on 348 degrees of freedom
Multiple R-squared:  0.0002035, Adjusted R-squared:  -0.002669 
F-statistic: 0.07083 on 1 and 348 DF,  p-value: 0.7903

summary(perfModel)


Call:
lm(formula = perf ~ cc10 + female, data = algorithmData)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.3056 -1.2325 -0.0002  1.2760  7.7529 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 14.77696    0.22849  64.673  < 2e-16 ***
cc10         0.08478    0.02352   3.604 0.000359 ***
female      -1.35005    0.27738  -4.867 1.72e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.426 on 347 degrees of freedom
Multiple R-squared:  0.09443,   Adjusted R-squared:  0.08921 
F-statistic: 18.09 on 2 and 347 DF,  p-value: 3.356e-08

summary(useModel)


Call:
lm(formula = use ~ perf + cc10 + female, data = algorithmData)

Residuals:
    Min      1Q  Median      3Q     Max 
-40.437  -7.195   0.000   8.537  50.326 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 32.56814    4.94663   6.584 1.70e-10 ***
perf         1.31400    0.32168   4.085 5.48e-05 ***
cc10        -0.13905    0.14358  -0.968    0.333    
female       0.01475    1.71790   0.009    0.993    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 14.54 on 346 degrees of freedom
Multiple R-squared:  0.04906,   Adjusted R-squared:  0.04082 
F-statistic:  5.95 on 3 and 346 DF,  p-value: 0.0005763

Adding Interactions Using the Just-Another-Variable Approach

In this section, we introduce an interaction term between female and cc10 within the SEM framework. The interaction is treated as a separate variable, allowing lavaan to estimate its mean, variance, and covariances.

Creating the Interaction Variable

We first compute the interaction term as the product of female and cc10.

data03$femXcc10 = data03$female * data03$cc10

Specifying the Model Syntax

This model includes femXcc10 as an exogenous variable, alongside cc10 and female. The model estimates means, variances, and covariances for these variables, as well as their effects on perf and use.

model05.syntax = "

# Exogenous Variables 
#Mean:
cc10 ~ 1
female ~ 1
femXcc10 ~ 1

#Variances:
cc10 ~~ cc10
female ~~ female
femXcc10 ~~ femXcc10

#Covariances:
cc10 ~~ female
cc10 ~~ femXcc10
female ~~ femXcc10

# Endogenous Variables 
#Regressions:
perf ~ 1 + female + cc10 + femXcc10
use  ~ 1 + female + cc10 + femXcc10

#Residual Variances:
perf ~~ perf
use  ~~ use

#Residual Covariance:
perf ~~ use
"

Estimating the Model

We fit the model using lavaan, incorporating the interaction term as an exogenous predictor.

model05.fit = lavaan(model05.syntax, data=data03, estimator = "MLR", mimic = "MPLUS")

Warning: lavaan->lav_data_full():  
   some observed variances are (at least) a factor 1000 times larger than 
   others; use varTable(fit) to investigate

Summarizing the Model Results

We extract and examine the model summary, including standardized estimates and fit measures.

summary(model05.fit, standardized=TRUE, fit.measures=TRUE)

lavaan 0.6-19 ended normally after 94 iterations

  Estimator                                         ML
  Optimization method                           NLMINB
  Number of model parameters                        20

  Number of observations                           350
  Number of missing patterns                        15

Model Test User Model:
                                              Standard      Scaled
  Test Statistic                                 0.000       0.000
  Degrees of freedom                                 0           0

Model Test Baseline Model:

  Test statistic                               221.450     194.386
  Degrees of freedom                                10          10
  P-value                                        0.000       0.000
  Scaling correction factor                                  1.139

User Model versus Baseline Model:

  Comparative Fit Index (CFI)                    1.000       1.000
  Tucker-Lewis Index (TLI)                       1.000       1.000
                                                                  
  Robust Comparative Fit Index (CFI)                         1.000
  Robust Tucker-Lewis Index (TLI)                            1.000

Loglikelihood and Information Criteria:

  Loglikelihood user model (H0)              -3430.447   -3430.447
  Loglikelihood unrestricted model (H1)      -3430.447   -3430.447
                                                                  
  Akaike (AIC)                                6900.893    6900.893
  Bayesian (BIC)                              6978.052    6978.052
  Sample-size adjusted Bayesian (SABIC)       6914.605    6914.605

Root Mean Square Error of Approximation:

  RMSEA                                          0.000          NA
  90 Percent confidence interval - lower         0.000          NA
  90 Percent confidence interval - upper         0.000          NA
  P-value H_0: RMSEA <= 0.050                       NA          NA
  P-value H_0: RMSEA >= 0.080                       NA          NA
                                                                  
  Robust RMSEA                                               0.000
  90 Percent confidence interval - lower                     0.000
  90 Percent confidence interval - upper                     0.000
  P-value H_0: Robust RMSEA <= 0.050                            NA
  P-value H_0: Robust RMSEA >= 0.080                            NA

Standardized Root Mean Square Residual:

  SRMR                                           0.000       0.000

Parameter Estimates:

  Standard errors                             Sandwich
  Information bread                           Observed
  Observed information based on                Hessian

Regressions:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
  perf ~                                                                
    female           -1.398    0.344   -4.065    0.000   -1.398   -0.234
    cc10              0.061    0.052    1.178    0.239    0.061    0.142
    femXcc10          0.053    0.066    0.805    0.421    0.053    0.092
  use ~                                                                 
    female           -1.856    2.201   -0.843    0.399   -1.856   -0.054
    cc10             -0.105    0.254   -0.412    0.680   -0.105   -0.042
    femXcc10          0.167    0.331    0.505    0.613    0.167    0.050

Covariances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
  female ~~                                                             
    cc10              0.103    0.209    0.494    0.621    0.103    0.033
  cc10 ~~                                                               
    femXcc10         24.567    3.020    8.134    0.000   24.567    0.749
  female ~~                                                             
    femXcc10         -0.005    0.124   -0.043    0.966   -0.005   -0.002
 .perf ~~                                                               
   .use               9.298    2.784    3.340    0.001    9.298    0.207

Intercepts:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
    cc10             -0.208    0.424   -0.490    0.624   -0.208   -0.031
    female            0.643    0.027   23.966    0.000    0.643    1.344
    femXcc10         -0.016    0.348   -0.046    0.963   -0.016   -0.003
   .perf             14.737    0.258   57.048    0.000   14.737    5.157
   .use              51.972    1.750   29.696    0.000   51.972    3.131

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
    cc10             43.503    3.837   11.337    0.000   43.503    1.000
    female            0.229    0.008   29.729    0.000    0.229    1.000
    femXcc10         24.733    3.032    8.157    0.000   24.733    1.000
   .perf              7.345    0.644   11.411    0.000    7.345    0.899
   .use             274.374   24.620   11.144    0.000  274.374    0.996

Inspecting Model-Implied Mean Vector and Covariance Matrix

We check the expected mean vector and covariance matrix under the estimated model.

fitted(model05.fit)

$cov
            perf     use  female    cc10  fmXc10
perf       8.168                                
use        9.953 275.516                        
female    -0.314  -0.437   0.229                
cc10       3.824  -0.637   0.103  43.503        
femXcc10   2.822   1.576  -0.005  24.567  24.733

$mean
    perf      use   female     cc10 femXcc10 
  13.825   50.797    0.643   -0.208   -0.016

Examining the Saturated Model Statistics

We inspect the sample statistics from the fully saturated model.

inspect(model05.fit, what="sampstat.h1")

$cov
            perf     use  female    cc10  fmXc10
perf       8.168                                
use        9.953 275.516                        
female    -0.314  -0.437   0.229                
cc10       3.824  -0.637   0.103  43.503        
femXcc10   2.822   1.576  -0.005  24.567  24.733

$mean
    perf      use   female     cc10 femXcc10 
  13.825   50.797    0.643   -0.208   -0.016

Assessing Residuals

We evaluate the raw residuals to assess model fit discrepancies.

residuals(model05.fit, type = "raw")

$type
[1] "raw"

$cov
         perf use female cc10 fmXc10
perf        0                       
use         0   0                   
female      0   0      0            
cc10        0   0      0    0       
femXcc10    0   0      0    0      0

$mean
    perf      use   female     cc10 femXcc10 
       0        0        0        0        0

We also check the normalized residuals for further model diagnostics.

residuals(model05.fit, type = "normalized")

$type
[1] "normalized"

$cov
         perf use female cc10 fmXc10
perf        0                       
use         0   0                   
female      0   0      0            
cc10        0   0      0    0       
femXcc10    0   0      0    0      0

$mean
    perf      use   female     cc10 femXcc10 
       0        0        0        0        0

Checking Modification Indices

The modindices() function suggests potential model improvements by identifying additional parameters to estimate.

modindices(model05.fit)

[1] lhs      op       rhs      mi       epc      sepc.lv  sepc.all sepc.nox
<0 rows> (or 0-length row.names)

Calculating R-Squared Values for Dependent Variables

We compute the proportion of variance explained for the dependent variables in the model.

inspect(model05.fit, what="r2")

 perf   use 
0.101 0.004

Visualizing the Model

We generate path diagrams to illustrate the estimated relationships.

Path Diagram Without Estimates

semPaths(model05.fit, intercepts = FALSE, residuals = TRUE, style="mx", layout="tree", rotation=2,
         optimizeLatRes = TRUE, sizeMan = 8, what ="path")

Path Diagram With Estimates

semPaths(model05.fit, intercepts = FALSE, residuals = TRUE, style="mx", layout="tree", rotation=2,
         optimizeLatRes = TRUE, sizeMan = 8, whatLabels ="est")

Factored Regression with Interaction Terms Using the EM Algorithm

In this section, we use factored regression and the Expectation-Maximization (EM) algorithm to estimate missing values while incorporating an interaction term (femXcc10). The model follows:

[ f(use, perf, cc10, female, femXcc10) = f(use|perf, cc10, female, femXcc10) * f(perf|cc10, female, femXcc10) * f(cc10|female) * f(female) ]

Initializing Model Parameters

We set initial values for the regression coefficients in the factored regression framework.

# f(female):
female_intercept = 0

# f(cc10|female) 
cc10_intercept = 0
cc10_female = 1

# f(perf|cc10, female):
perf_intercept = 0
perf_cc10 = 1
perf_female = 1
perf_femXcc10 = 1

# f(use|perf, cc10, female, femXcc10):
use_intercept = 0
use_perf = 1
use_cc10 = 1
use_female = 1
use_femXcc10 = 1

Identifying Missing Data

We determine which observations have missing values in each variable.

femaleMissing = which(is.na(data03$female))
cc10Missing = which(is.na(data03$cc10))
perfMissing = which(is.na(data03$perf))
useMissing = which(is.na(data03$use))

Setting Up the EM Algorithm

We create a working copy of the dataset and set the number of iterations.

algorithmData = data03
maxIterations = 10
iteration = 1

EM Algorithm: Expectation and Maximization Steps

We iterate through the EM process, imputing missing data in the E-step and re-estimating parameters in the M-step.

while(iteration < maxIterations){
  # E-step: Impute missing values
  algorithmData$female[femaleMissing] = 
    round(exp(female_intercept)/(1+exp(female_intercept)), 0)
  
  algorithmData$cc10[cc10Missing] = 
    cc10_intercept + 
    cc10_female*algorithmData$female[cc10Missing]
  
  algorithmData$perf[perfMissing] =
    perf_intercept + 
    perf_cc10*algorithmData$cc10[perfMissing] + 
    perf_female*algorithmData$female[perfMissing] +
    perf_femXcc10*algorithmData$cc10[perfMissing]*algorithmData$female[perfMissing]
  
  algorithmData$use[useMissing] =
    use_intercept + 
    use_perf*algorithmData$perf[useMissing] + 
    use_cc10*algorithmData$cc10[useMissing] +
    use_female*algorithmData$female[useMissing] +
    use_femXcc10*algorithmData$cc10[useMissing]*algorithmData$female[useMissing]
  
  # M-step: Update regression coefficients
  femaleModel = glm(female ~ 1, data = algorithmData, family = binomial(link="logit"))
  female_intercept = femaleModel$coefficients[1]
  
  cc10Model = lm(cc10 ~ female, data = algorithmData)
  cc10_intercept = cc10Model$coefficients["(Intercept)"]
  cc10_female = cc10Model$coefficients["female"]
  
  perfModel = lm(perf ~ cc10 + female + cc10:female, data = algorithmData)
  perf_intercept = perfModel$coefficients["(Intercept)"]
  perf_cc10 = perfModel$coefficients["cc10"]
  perf_female = perfModel$coefficients["female"]
  perf_femXcc10 = perfModel$coefficients["cc10:female"]
  
  useModel = lm(use ~ perf + cc10 + female + cc10:female, data = algorithmData)
  use_intercept = useModel$coefficients["(Intercept)"]
  use_perf = useModel$coefficients["perf"]
  use_cc10 = useModel$coefficients["cc10"]
  use_female = useModel$coefficients["female"]
  use_femXcc10 = useModel$coefficients["cc10:female"]
  
  # Compute total log-likelihood
  loglik = sum(
    dbinom(algorithmData$female, size = 1, prob = exp(female_intercept)/(1+exp(female_intercept)), log = TRUE) +
    dnorm(algorithmData$cc10, mean = cc10_intercept + cc10_female*algorithmData$female, sd = sqrt(var(algorithmData$cc10)), log = TRUE) +
    dnorm(algorithmData$perf, mean = perf_intercept + perf_cc10*algorithmData$cc10 + perf_female*algorithmData$female + perf_femXcc10*algorithmData$cc10*algorithmData$female, sd = sqrt(var(algorithmData$perf)), log = TRUE) +
    dnorm(algorithmData$use, mean = use_intercept + use_perf*algorithmData$perf + use_cc10*algorithmData$cc10 + use_female*algorithmData$female + use_femXcc10*algorithmData$cc10*algorithmData$female, sd = sqrt(var(algorithmData$use)), log = TRUE)
  )
  
  # Print iteration number and log-likelihood
  cat("Iteration: ", iteration, "Loglikelihood: ", loglik, "\n")
  iteration = iteration + 1
}

Iteration:  1 Loglikelihood:  -4064.091 
Iteration:  2 Loglikelihood:  -3621.814 
Iteration:  3 Loglikelihood:  -3556.724 
Iteration:  4 Loglikelihood:  -3552.105 
Iteration:  5 Loglikelihood:  -3551.822 
Iteration:  6 Loglikelihood:  -3551.806 
Iteration:  7 Loglikelihood:  -3551.806 
Iteration:  8 Loglikelihood:  -3551.806 
Iteration:  9 Loglikelihood:  -3551.806

Summarizing the Final Models

After running the EM algorithm, we summarize the estimated models.

summary(femaleModel)


Call:
glm(formula = female ~ 1, family = binomial(link = "logit"), 
    data = algorithmData)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)   0.7407     0.1143   6.479 9.24e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 440.3  on 349  degrees of freedom
Residual deviance: 440.3  on 349  degrees of freedom
AIC: 442.3

Number of Fisher Scoring iterations: 4

summary(cc10Model)


Call:
lm(formula = cc10 ~ female, data = algorithmData)

Residuals:
     Min       1Q   Median       3Q      Max 
-18.5496  -2.9193   0.0002   2.5035  18.6038 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  -0.4602     0.5201  -0.885    0.377
female        0.1682     0.6320   0.266    0.790

Residual standard error: 5.528 on 348 degrees of freedom
Multiple R-squared:  0.0002035, Adjusted R-squared:  -0.002669 
F-statistic: 0.07083 on 1 and 348 DF,  p-value: 0.7903

summary(perfModel)


Call:
lm(formula = perf ~ cc10 + female + cc10:female, data = algorithmData)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.7097 -1.1405 -0.0001  1.2670  7.7489 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 14.76087    0.22900  64.459  < 2e-16 ***
cc10         0.06264    0.03695   1.695   0.0909 .  
female      -1.32452    0.27818  -4.761 2.83e-06 ***
cc10:female  0.04068    0.04794   0.849   0.3967    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.428 on 346 degrees of freedom
Multiple R-squared:  0.09696,   Adjusted R-squared:  0.08913 
F-statistic: 12.38 on 3 and 346 DF,  p-value: 1.034e-07

summary(useModel)


Call:
lm(formula = use ~ perf + cc10 + female + cc10:female, data = algorithmData)

Residuals:
    Min      1Q  Median      3Q     Max 
-40.797  -7.371   0.000   8.547  50.678 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 32.56669    4.95164   6.577 1.78e-10 ***
perf         1.31101    0.32231   4.068 5.89e-05 ***
cc10        -0.22998    0.22245  -1.034    0.302    
female       0.07482    1.72153   0.043    0.965    
cc10:female  0.15229    0.28769   0.529    0.597    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 14.55 on 345 degrees of freedom
Multiple R-squared:  0.05009,   Adjusted R-squared:  0.03908 
F-statistic: 4.548 on 4 and 345 DF,  p-value: 0.001363

Incorporating Auxiliary Variables in the SEM Framework

In this section, we extend our SEM model by including auxiliary variables (hsl, msc, mas, mse). These variables help improve estimation by capturing additional variance and reducing bias due to missing data.

Specifying the Model Syntax

We define a model that includes the auxiliary variables and estimates their means, variances, and covariances.

model06.syntax = "

# Exogenous Variables (now with auxiliary variables)
#Means:
cc10 ~ 1
female ~ 1

# Auxiliary variables
hsl ~ 1
msc ~ 1
mas ~ 1
mse ~ 1

#Variances:
cc10 ~~ cc10
female ~~ female

# Auxiliary variables:
hsl ~~ hsl
msc ~~ msc
mas ~~ mas
mse ~~ mse

#Covariances:
cc10 ~~ female + hsl + msc + mas + mse
female ~~ hsl + msc + mas + mse
hsl ~~ msc + mas + mse
msc ~~ mas + mse
mas ~~ mse

# Endogenous Variables 
#Regressions:
perf ~ 1 + female + cc10
use  ~ 1 + female + cc10

#Residual Variances:
perf ~~ perf
use  ~~ use

#Residual Covariance:
perf ~~ use

#Endogenous/exogenous covariances
perf ~~ hsl + msc + mas + mse
use  ~~ hsl + msc + mas + mse
"

Estimating the Model

We fit the model using lavaan, incorporating auxiliary variables into the SEM framework.

model06.fit = lavaan(model06.syntax, data=data03, estimator = "MLR", mimic = "MPLUS")

Warning: lavaan->lav_data_full():  
   some observed variances are (at least) a factor 1000 times larger than 
   others; use varTable(fit) to investigate

Summarizing the Model Results

We extract and examine the model summary, including standardized estimates and fit measures.

summary(model06.fit, standardized=TRUE, fit.measures=TRUE)

lavaan 0.6-19 ended normally after 320 iterations

  Estimator                                         ML
  Optimization method                           NLMINB
  Number of model parameters                        44

  Number of observations                           350
  Number of missing patterns                        15

Model Test User Model:
                                              Standard      Scaled
  Test Statistic                                 0.000       0.000
  Degrees of freedom                                 0           0

Model Test Baseline Model:

  Test statistic                              1200.876    1186.405
  Degrees of freedom                                28          28
  P-value                                        0.000       0.000
  Scaling correction factor                                  1.012

User Model versus Baseline Model:

  Comparative Fit Index (CFI)                    1.000       1.000
  Tucker-Lewis Index (TLI)                       1.000       1.000
                                                                  
  Robust Comparative Fit Index (CFI)                         1.000
  Robust Tucker-Lewis Index (TLI)                            1.000

Loglikelihood and Information Criteria:

  Loglikelihood user model (H0)              -6976.618   -6976.618
  Loglikelihood unrestricted model (H1)      -6976.618   -6976.618
                                                                  
  Akaike (AIC)                               14041.235   14041.235
  Bayesian (BIC)                             14210.984   14210.984
  Sample-size adjusted Bayesian (SABIC)      14071.400   14071.400

Root Mean Square Error of Approximation:

  RMSEA                                          0.000          NA
  90 Percent confidence interval - lower         0.000          NA
  90 Percent confidence interval - upper         0.000          NA
  P-value H_0: RMSEA <= 0.050                       NA          NA
  P-value H_0: RMSEA >= 0.080                       NA          NA
                                                                  
  Robust RMSEA                                               0.000
  90 Percent confidence interval - lower                     0.000
  90 Percent confidence interval - upper                     0.000
  P-value H_0: Robust RMSEA <= 0.050                            NA
  P-value H_0: Robust RMSEA >= 0.080                            NA

Standardized Root Mean Square Residual:

  SRMR                                           0.000       0.000

Parameter Estimates:

  Standard errors                             Sandwich
  Information bread                           Observed
  Observed information based on                Hessian

Regressions:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
  perf ~                                                                
    female           -1.348    0.323   -4.177    0.000   -1.348   -0.230
    cc10              0.077    0.031    2.472    0.013    0.077    0.181
  use ~                                                                 
    female           -1.631    2.130   -0.766    0.444   -1.631   -0.047
    cc10              0.090    0.154    0.588    0.557    0.090    0.036

Covariances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
  female ~~                                                             
    cc10              0.110    0.207    0.533    0.594    0.110    0.035
  cc10 ~~                                                               
    hsl               1.667    0.475    3.506    0.000    1.667    0.217
    msc              18.586    6.893    2.696    0.007   18.586    0.167
    mas              10.186    4.452    2.288    0.022   10.186    0.139
    mse              16.436    4.361    3.769    0.000   16.436    0.225
  female ~~                                                             
    hsl              -0.023    0.031   -0.758    0.448   -0.023   -0.042
    msc              -1.470    0.444   -3.313    0.001   -1.470   -0.182
    mas              -1.319    0.296   -4.461    0.000   -1.319   -0.247
    mse              -1.471    0.289   -5.094    0.000   -1.471   -0.278
  hsl ~~                                                                
    msc               9.654    1.226    7.874    0.000    9.654    0.488
    mas               5.586    0.783    7.134    0.000    5.586    0.429
    mse               6.455    0.773    8.353    0.000    6.455    0.499
  msc ~~                                                                
    mas             163.789   13.358   12.262    0.000  163.789    0.867
    mse             120.603   11.970   10.075    0.000  120.603    0.643
  mas ~~                                                                
    mse              77.324    8.012    9.651    0.000   77.324    0.625
 .perf ~~                                                               
   .use              10.321    2.658    3.883    0.000   10.321    0.230
    hsl               1.231    0.203    6.067    0.000    1.231    0.393
    msc              23.991    2.990    8.025    0.000   23.991    0.528
    mas              15.561    1.912    8.139    0.000   15.561    0.519
    mse              19.275    1.775   10.857    0.000   19.275    0.648
 .use ~~                                                                
    hsl               3.185    1.144    2.785    0.005    3.185    0.163
    msc             150.109   18.284    8.210    0.000  150.109    0.531
    mas              78.455   11.116    7.058    0.000   78.455    0.421
    mse              61.943   10.620    5.833    0.000   61.943    0.335

Intercepts:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
    cc10             -0.171    0.420   -0.408    0.683   -0.171   -0.026
    female            0.643    0.027   24.042    0.000    0.643    1.344
    hsl               4.882    0.062   78.246    0.000    4.882    4.182
    msc              49.951    0.905   55.199    0.000   49.951    2.951
    mas              31.902    0.596   53.489    0.000   31.902    2.859
    mse              73.414    0.592  123.941    0.000   73.414    6.625
   .perf             14.786    0.246   60.019    0.000   14.786    5.273
   .use              52.513    1.693   31.019    0.000   52.513    3.138

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
    cc10             43.432    3.835   11.326    0.000   43.432    1.000
    female            0.229    0.008   29.677    0.000    0.229    1.000
    hsl               1.363    0.097   14.031    0.000    1.363    1.000
    msc             286.613   21.544   13.304    0.000  286.613    1.000
    mas             124.504    9.653   12.898    0.000  124.504    1.000
    mse             122.801    8.763   14.013    0.000  122.801    1.000
   .perf              7.210    0.595   12.128    0.000    7.210    0.917
   .use             279.096   24.901   11.208    0.000  279.096    0.997

Inspecting Model-Implied Mean Vector and Covariance Matrix

We check the expected mean vector and covariance matrix under the estimated model.

fitted(model06.fit)

$cov
          perf     use  female    cc10     hsl     msc     mas     mse
perf     7.862                                                        
use     11.099 280.027                                                
female  -0.300  -0.363   0.229                                        
cc10     3.202   3.742   0.110  43.432                                
hsl      1.392   3.374  -0.023   1.667   1.363                        
msc     27.407 154.185  -1.470  18.586   9.654 286.613                
mas     18.125  81.527  -1.319  10.186   5.586 163.789 124.504        
mse     22.526  65.827  -1.471  16.436   6.455 120.603  77.324 122.801

$mean
  perf    use female   cc10    hsl    msc    mas    mse 
13.906 51.449  0.643 -0.171  4.882 49.951 31.902 73.414

Examining the Saturated Model Statistics

We inspect the sample statistics from the fully saturated model.

inspect(model06.fit, what="sampstat.h1")

$cov
          perf     use  female    cc10     hsl     msc     mas     mse
perf     7.862                                                        
use     11.099 280.027                                                
female  -0.300  -0.363   0.229                                        
cc10     3.202   3.742   0.110  43.432                                
hsl      1.392   3.374  -0.023   1.667   1.363                        
msc     27.407 154.185  -1.470  18.586   9.654 286.613                
mas     18.125  81.527  -1.319  10.186   5.586 163.789 124.504        
mse     22.526  65.827  -1.471  16.436   6.455 120.603  77.324 122.801

$mean
  perf    use female   cc10    hsl    msc    mas    mse 
13.906 51.449  0.643 -0.171  4.882 49.951 31.902 73.414

Assessing Residuals

We evaluate the raw residuals to assess model fit discrepancies.

residuals(model06.fit, type = "raw")

$type
[1] "raw"

$cov
       perf use female cc10 hsl msc mas mse
perf      0                                
use       0   0                            
female    0   0      0                     
cc10      0   0      0    0                
hsl       0   0      0    0   0            
msc       0   0      0    0   0   0        
mas       0   0      0    0   0   0   0    
mse       0   0      0    0   0   0   0   0

$mean
  perf    use female   cc10    hsl    msc    mas    mse 
     0      0      0      0      0      0      0      0

We also check the normalized residuals for further model diagnostics.

residuals(model06.fit, type = "normalized")

$type
[1] "normalized"

$cov
       perf use female cc10 hsl msc mas mse
perf      0                                
use       0   0                            
female    0   0      0                     
cc10      0   0      0    0                
hsl       0   0      0    0   0            
msc       0   0      0    0   0   0        
mas       0   0      0    0   0   0   0    
mse       0   0      0    0   0   0   0   0

$mean
  perf    use female   cc10    hsl    msc    mas    mse 
     0      0      0      0      0      0      0      0

Checking Modification Indices

The modindices() function suggests potential model improvements by identifying additional parameters to estimate.

modindices(model06.fit)

[1] lhs      op       rhs      mi       epc      sepc.lv  sepc.all sepc.nox
<0 rows> (or 0-length row.names)

Calculating R-Squared Values for Dependent Variables

We compute the proportion of variance explained for the dependent variables in the model.

inspect(model06.fit, what="r2")

 perf   use 
0.083 0.003

Visualizing the Model

We generate path diagrams to illustrate the estimated relationships.

Path Diagram Without Estimates

semPaths(model06.fit, intercepts = FALSE, residuals = TRUE, style="mx", layout="tree", rotation=2,
         optimizeLatRes = TRUE, sizeMan = 8, what ="path")

Path Diagram With Estimates

semPaths(model06.fit, intercepts = FALSE, residuals = TRUE, style="mx", layout="tree", rotation=2,
         optimizeLatRes = TRUE, sizeMan = 8, whatLabels ="est")

Analysis Strategy Decision Strategy

This text-based flowchart outlines the decision-making process for handling missing data in analysis.

Have missing data?
- No → Conduct analysis as usual.
- Yes → Proceed to the next step.
  - Minimum goal: Build an analysis that meets the assumptions of Missing at Random (MAR). For Maximum Likelihood Estimation, use likelihoods to your benefit.
Are all variables with missing data (in your model and auxiliary variables) plausibly continuous (so that you can assume they follow multivariate normal distributions)?
- Yes → Use SEM framework (lavaan) where all variables (including auxiliary variables) are in the likelihood function.
- No → Use factored regression approach (likely not EM algorithm unless you are savvy at coding), as implemented in Bayesian methods (next lecture). Or use Multiple Imputation methods that treat non-normal data appropriately.

This structured approach ensures a systematic decision-making process for handling missing data in analysis.

Introduction

Loading Required Packages

Simulating Data for the Example

Setting the Seed

Defining Sample Sizes

Defining Mean Vectors

Defining the Correlation Matrix

Defining Standard Deviations and Constructing the Covariance Matrix

Generating Multivariate Normal Data

Adding Group Identifiers

Combining and Formatting the Dataset

Introducing Missingness in Key Variables

Introducing Missingness in cc

Introducing Missingness in perf

Introducing Missingness in use

Converting the Data into a Data Frame

Centering cc and Initial Data Exploration

Visualizing cc

Centering cc

Creating an Interaction Term

Preparing Data for Analysis

Checking for Completely Missing Rows

Building Analysis Model #1: Empty Model

Checking Missing Data Patterns

Defining the Model Syntax

Model Estimation Using lavaan

Comparing Model-Implied and Observed Covariance Matrices

Inspecting Model Residuals

Checking R-Squared Values

Visualizing the Model

Verifying Log-Likelihood Calculation

Manual Log-Likelihood Calculation

Log-Likelihood from lavaan

Comparing Factored Regression Specifications

Model 1b: perf ~ use and use ~ 1

Model 1c: use ~ perf and perf ~ 1

Model 2: Adding Predictors as Main Effects

Specifying the Model Syntax

Estimating the Model

Summarizing the Model Results

Inspecting Model-Implied Mean Vector and Covariance Matrix

Examining the Saturated Model Statistics

Assessing Residuals

Checking Modification Indices

Calculating R-Squared Values for Dependent Variables

Visualizing the Model

Path Diagram Without Estimates

Path Diagram With Estimates

Model 3: Expanding the Likelihood Function

Specifying the Model Syntax

Estimating the Model

Summarizing the Model Results

Inspecting Model-Implied Mean Vector and Covariance Matrix

Examining the Saturated Model Statistics

Assessing Residuals

Checking Modification Indices

Calculating R-Squared Values for Dependent Variables

Visualizing the Model

Path Diagram Without Estimates

Path Diagram With Estimates

Handling Missing Data in a Categorical Variable (female)

Creating Missing Values in female

Running the Model with Missing female

Model 4: Adding female to the Likelihood Function

Specifying the Model Syntax

Estimating the Model

Summarizing the Model Results

Inspecting Model-Implied Mean Vector and Covariance Matrix

Examining the Saturated Model Statistics

Assessing Residuals

Checking Modification Indices

Calculating R-Squared Values for Dependent Variables

Visualizing the Model

Path Diagram Without Estimates

Path Diagram With Estimates

Addressing the Normality Issue with female

Handling Non-Normal female Using Factored Regression and the EM Algorithm

Initializing Model Parameters

Identifying Missing Data

Setting Up the EM Algorithm

Introducing Missingness in `cc`

Introducing Missingness in `perf`

Introducing Missingness in `use`

Centering `cc` and Initial Data Exploration

Visualizing `cc`

Centering `cc`

Model Estimation Using `lavaan`

Log-Likelihood from `lavaan`

Model 1b: `perf ~ use` and `use ~ 1`

Model 1c: `use ~ perf` and `perf ~ 1`

Handling Missing Data in a Categorical Variable (`female`)

Creating Missing Values in `female`

Running the Model with Missing `female`

Model 4: Adding `female` to the Likelihood Function

Addressing the Normality Issue with `female`

Handling Non-Normal `female` Using Factored Regression and the EM Algorithm