- Principles of Bayesian methods for psychometric models
- Comparisons with maximum likelihood methods

- Notation:
- \(\mathbf{x}\) represents data
- \(\boldsymbol{\theta}\) represents model parameters
- \(p \left(\mathbf{x} \mid \boldsymbol{\theta} \right)\) is the conditional distribution of data given model parameters
- \(L \left(\boldsymbol{\theta} \mid \mathbf{x} \right) = p \left(\mathbf{x} \mid \boldsymbol{\theta} \right)\) is the likelihood function

The likelihood function in ML is equal to the conditional distribution of data given model parameters in Bayesian - Expresses that in ML, once data are observed, they are treated as known

\[L \left(\boldsymbol{\theta} \mid \mathbf{x} \right) = p \left(\mathbf{x} \mid \boldsymbol{\theta} \right)\]

- Data are random
- Parameters are fixed/known in ML
- “What are the values of the parameters that yield the highest probability of observing the values of data that were observed”
- The ML estimator is a function of the data
- With the data having a distribution, the ML estimator is now a random variable

- Standard errors give uncertainty in estimates
- Asymptotic theory gives rise to calculations of:
- Parameter estimates
- Standard errors
- The assumed shape of the sampling distributions of parameters
- The test distribution for parameters
- The confidence intervals for parameters

- Parameter point estimates, SEs, and confidence intervals are fixed, not random
- Cannot be discussed probabilistically

- Probability (or uncertainty) enters when talking about repeated sampling of the data
- Statements of uncertainty refer to the parameter estimator, not the parameter itself
- Supports deductive arguments from parameters (general) to the data (particulars)
- Does not support inductive arguments from data (particulars) to general (parameters)

- Key is that \(L \left(\boldsymbol{\theta} \mid \mathbf{x} \right) = p \left(\mathbf{x} \mid \boldsymbol{\theta} \right)\)
- Likelihood function is central to Bayesian estimation

- Treats all parameters as random
- Model parameters are assigned prior distributions
- Prior distributions are statements of uncertainty

- Posterior distribution is blend of prior and likelihood

Imagine \(J\) independent and identically distributed Bernoulli random variables \(\mathbf{x} = \left(x_1, \ldots, x_J \right)\), each with probability parameter \(\theta\).

Conditional distribution of the data given the parameter for any one variable \(j\):

\[p\left( x_j \mid \theta \right) = \theta^{x_j}\left(1-\theta\right)^{1-x_j}\] Conditional distribution of the sample \(\mathbf{x}\):

\[p\left( \mathbf{x} \mid \theta \right) = \prod_{j=1}^{J} p\left( x_j \mid \theta \right) = \prod_{j=1}^{J} \theta^{x_j}\left(1-\theta\right)^{1-x_j}\] ### Binomial

Let \(y = \sum_{j=1}^J x_j\) be the number of 1s in \(\mathbf{x}\)

The conditional distribution of \(y\) given \(\theta\) and \(J\) is a binomial:

\[p \left(y \mid \theta, J \right) = {J \choose y} \theta^y\left(1-\theta \right)^{J-y}\]

Suppose \(y = 7\) and \(J=10\). Let’s compare ML vs. Bayesian

We could derive the MLE for \(\theta\) and find that it is:

\(\renewcommand{\hat}[1]{\widehat{#1}}\) \[\hat{\theta} = \frac{y}{J}\]

Under ML, we know that the MLE for \(\theta\) is \(.7\). We can plot the likelihood function \(L(\theta \mid y) = p \left(y \mid \theta, J \right)\) an confirm this:

```
plot(x = seq(0.001,.999,.001), y = dbinom(x = 7, size = 10, prob = seq(0.001,.999,.001)), type = "l", xlab = expression(theta), ylab = expression(paste("p(y|", theta, ",J)")))
lines(x = c(.7,.7), y=c(0,dbinom(x = 7, size = 10, prob = .7)), lty = 2)
```

For a Bayesian analysis, we need to plot the posterior distribution, or \(p \left(\theta \mid y, J \right) \propto p\left(y \mid \theta, J \right) p\left(\theta\right)\).

To do so, we must pick a prior distribution for \(\theta\), or \(p\left(\theta \right)\). I argue that finding good priors is difficult and not made any easier by example analyses like this one. But, we’ll continue.

Let’s pick a distribution with the same sample space (or support; range of values for the data) that \(\theta\) has. One such distribution is the uniform distribution. However, the uniform distribution doesn’t make calculations of the posterior very easy (but…we could use one!). Instead, we will use a beta distribution, or \(\theta \sim Beta\left( \alpha, \beta \right)\). You can change `alpha`

and `beta`

below to alter the shape

```
alpha = 1
beta = 1
x = seq(.001, .999, .001)
y = dbeta(x = x, shape1 = alpha, shape2 = beta)
plot(x = x, y = y, type = "l", xlab = expression(theta), ylab = expression(paste("p(", theta, ")")))
```

With the Beta distribution as a prior, it turns out the posterior is also a Beta distribution, making it a *conjugate* prior.

The posterior is then:

\[ p \left(\theta \mid y, J \right) = Beta \left( \theta \mid y + \alpha, J - y + \beta \right)\]

```
y = 7
J = 10
priorAlpha = 1
priorBeta = 1
posteriorAlpha = y + priorAlpha
posteriorBeta = J - y + priorBeta
xCoord = seq(.001, .999, .001)
yCoord = dbeta(x = xCoord, shape1 = posteriorAlpha, shape2 = posteriorBeta)
mapEstimate = xCoord[which(yCoord==max(yCoord))]
posteriorMode = (posteriorAlpha - 1)/(posteriorAlpha + posteriorBeta - 2)
posteriorMean = (posteriorAlpha)/(posteriorAlpha + posteriorBeta)
posteriorVariance = (posteriorAlpha*posteriorBeta)/(((posteriorAlpha + posteriorBeta)^2)*(posteriorAlpha + posteriorBeta + 1))
posteriorSD = sqrt(posteriorVariance)
plot(x = xCoord, y = yCoord, type = "l", xlab = expression(theta),
ylab = expression(paste("p(", theta, ")")), main = paste("Red = MAP = ", round(x = mapEstimate, digits = 2),
"; Green = EAP = ", round(x = posteriorMean, digits = 2),
"; SD = ", round(x = posteriorSD, digits = 2)))
lines(x = c(mapEstimate, mapEstimate), y = c(0, dbeta(x = mapEstimate, shape1 = posteriorAlpha, shape2 = posteriorBeta)), lty = 2, col=2)
lines(x = c(posteriorMean, posteriorMean), y = c(0, dbeta(x = posteriorMean, shape1 = posteriorAlpha, shape2 = posteriorBeta)), lty = 2, col=3)
```

Now, lets put all functions onto the same plot to compare: Prior, Posterior, and Likelihood

```
y = 7
J = 10
priorAlpha = 6
priorBeta = 6
posteriorAlpha = y + priorAlpha
posteriorBeta = J - y + priorBeta
xCoord = seq(.001, .999, .001)
yCoordLikelihood = dbinom(x = 7, size = 10, prob = xCoord)
yCoordPosterior = dbeta(x = xCoord, shape1 = posteriorAlpha, shape2 = posteriorBeta)
yCoordPrior = dbeta(x = xCoord, shape1 = priorAlpha, shape2 = priorBeta)
mapEstimate = xCoord[which(yCoord==max(yCoord))]
posteriorMode = (posteriorAlpha - 1)/(posteriorAlpha + posteriorBeta - 2)
posteriorMean = (posteriorAlpha)/(posteriorAlpha + posteriorBeta)
posteriorVariance = (posteriorAlpha*posteriorBeta)/(((posteriorAlpha + posteriorBeta)^2)*(posteriorAlpha + posteriorBeta + 1))
posteriorSD = sqrt(posteriorVariance)
yMat = cbind(yCoordPrior, yCoordLikelihood, yCoordPosterior)
xMat = cbind(xCoord, xCoord, xCoord)
matplot(x = xMat, y = yMat, type = "l", xlab = expression(theta),
ylab = expression(paste("p(", theta, ")")), main = paste("Red = MAP = ", round(x = mapEstimate, digits = 2),
"; Green = EAP = ", round(x = posteriorMean, digits = 2),
"; SD = ", round(x = posteriorSD, digits = 2)),
lwd = 3)
legend(x = 0, y = 1, legend = c("Prior", "Likelihood", "Posterior"), col = 1:3, lty = 1:3, lwd = 1:3)
```

Once the posterior distribution has been obtained, summaries of it can be used for inference - Central tendency: - Posterior mean (Expected A Posteriori) - Posterior mode (Maximum A Posteriori) - Posterior median (less often) - Note: Not all posterior distributions are symmetric (so using more than one summary is important) - Also, some may be multimodal

- Measures of variability:
- Posterior variance/SD
- Interval Summaries:
- Credible interval (CI) – based on percentiles of the posterior distribution
- Highest posterior density (HPD) – CI with shortest distance between upper and lower bound

With our Beta distribution, each of these can be obtained using the moments or the CDF

With MCMC, we’ll use brute force to summarize the posterior distribution

Often, Bayesian models are represented in a graphical model - Common type: Directed Acyclic Graph - Graph: A picture made of - Nodes: Variables - Edges: Relations between variables (X -> Y means node X predicts node Y) - Directed: All relations are directional (some variation about edges with two arrows) - Acyclic: No cycles present in a graph

There are any number of types of DAGs, and many Bayesians use differing types.

Two that are useful when initially setting up a Bayesian Model in syntax are given in Figure 2.7: