Highlights of Chapter

Review of Frequentist Inference with Maximum Likelihood

The likelihood function in ML is equal to the conditional distribution of data given model parameters in Bayesian - Expresses that in ML, once data are observed, they are treated as known

\[L \left(\boldsymbol{\theta} \mid \mathbf{x} \right) = p \left(\mathbf{x} \mid \boldsymbol{\theta} \right)\]

ML Specifics

ML Philosophy

Bayesian Specifics

Bernoulli and Binomial Models

First, Bernoulli:

Imagine \(J\) independent and identically distributed Bernoulli random variables \(\mathbf{x} = \left(x_1, \ldots, x_J \right)\), each with probability parameter \(\theta\).

Conditional distribution of the data given the parameter for any one variable \(j\):

\[p\left( x_j \mid \theta \right) = \theta^{x_j}\left(1-\theta\right)^{1-x_j}\] Conditional distribution of the sample \(\mathbf{x}\):

\[p\left( \mathbf{x} \mid \theta \right) = \prod_{j=1}^{J} p\left( x_j \mid \theta \right) = \prod_{j=1}^{J} \theta^{x_j}\left(1-\theta\right)^{1-x_j}\] ### Binomial

Let \(y = \sum_{j=1}^J x_j\) be the number of 1s in \(\mathbf{x}\)

The conditional distribution of \(y\) given \(\theta\) and \(J\) is a binomial:

\[p \left(y \mid \theta, J \right) = {J \choose y} \theta^y\left(1-\theta \right)^{J-y}\]

Now, the data

Suppose \(y = 7\) and \(J=10\). Let’s compare ML vs. Bayesian

We could derive the MLE for \(\theta\) and find that it is:

\(\renewcommand{\hat}[1]{\widehat{#1}}\) \[\hat{\theta} = \frac{y}{J}\]

Analysis with ML:

Under ML, we know that the MLE for \(\theta\) is \(.7\). We can plot the likelihood function \(L(\theta \mid y) = p \left(y \mid \theta, J \right)\) an confirm this:

plot(x = seq(0.001,.999,.001), y = dbinom(x = 7, size = 10, prob =  seq(0.001,.999,.001)), type = "l", xlab = expression(theta), ylab = expression(paste("p(y|", theta, ",J)")))
lines(x = c(.7,.7), y=c(0,dbinom(x = 7, size = 10, prob = .7)), lty = 2)

Analysis with Bayesian:

For a Bayesian analysis, we need to plot the posterior distribution, or \(p \left(\theta \mid y, J \right) \propto p\left(y \mid \theta, J \right) p\left(\theta\right)\).

To do so, we must pick a prior distribution for \(\theta\), or \(p\left(\theta \right)\). I argue that finding good priors is difficult and not made any easier by example analyses like this one. But, we’ll continue.

Let’s pick a distribution with the same sample space (or support; range of values for the data) that \(\theta\) has. One such distribution is the uniform distribution. However, the uniform distribution doesn’t make calculations of the posterior very easy (but…we could use one!). Instead, we will use a beta distribution, or \(\theta \sim Beta\left( \alpha, \beta \right)\). You can change alpha and beta below to alter the shape

alpha = 1
beta = 1
x = seq(.001, .999, .001)
y = dbeta(x = x, shape1 = alpha, shape2 = beta)
plot(x = x, y = y, type = "l", xlab = expression(theta), ylab = expression(paste("p(", theta, ")")))

With the Beta distribution as a prior, it turns out the posterior is also a Beta distribution, making it a conjugate prior.

The posterior is then:

\[ p \left(\theta \mid y, J \right) = Beta \left( \theta \mid y + \alpha, J - y + \beta \right)\]

y = 7
J = 10
priorAlpha = 1
priorBeta = 1

posteriorAlpha = y + priorAlpha
posteriorBeta = J - y + priorBeta

xCoord = seq(.001, .999, .001)
yCoord = dbeta(x = xCoord, shape1 = posteriorAlpha, shape2 = posteriorBeta)

mapEstimate = xCoord[which(yCoord==max(yCoord))]
posteriorMode = (posteriorAlpha - 1)/(posteriorAlpha + posteriorBeta - 2)
posteriorMean = (posteriorAlpha)/(posteriorAlpha + posteriorBeta)
posteriorVariance = (posteriorAlpha*posteriorBeta)/(((posteriorAlpha + posteriorBeta)^2)*(posteriorAlpha + posteriorBeta + 1))
posteriorSD = sqrt(posteriorVariance)

plot(x = xCoord, y = yCoord, type = "l", xlab = expression(theta), 
     ylab = expression(paste("p(", theta, ")")), main = paste("Red = MAP = ", round(x = mapEstimate, digits = 2), 
                                                              "; Green = EAP = ", round(x = posteriorMean, digits = 2),
                                                              "; SD = ", round(x = posteriorSD, digits = 2)))
lines(x = c(mapEstimate, mapEstimate), y = c(0, dbeta(x = mapEstimate, shape1 = posteriorAlpha, shape2 = posteriorBeta)), lty = 2, col=2)
lines(x = c(posteriorMean, posteriorMean), y = c(0, dbeta(x = posteriorMean, shape1 = posteriorAlpha, shape2 = posteriorBeta)), lty = 2, col=3)

Comparison of all Three Functions: Prior, Likelihood, and Posterior

Now, lets put all functions onto the same plot to compare: Prior, Posterior, and Likelihood

y = 7
J = 10
priorAlpha = 6
priorBeta = 6

posteriorAlpha = y + priorAlpha
posteriorBeta = J - y + priorBeta

xCoord = seq(.001, .999, .001)
yCoordLikelihood = dbinom(x = 7, size = 10, prob =  xCoord)
yCoordPosterior = dbeta(x = xCoord, shape1 = posteriorAlpha, shape2 = posteriorBeta)
yCoordPrior = dbeta(x = xCoord, shape1 = priorAlpha, shape2 = priorBeta)

mapEstimate = xCoord[which(yCoord==max(yCoord))]
posteriorMode = (posteriorAlpha - 1)/(posteriorAlpha + posteriorBeta - 2)
posteriorMean = (posteriorAlpha)/(posteriorAlpha + posteriorBeta)
posteriorVariance = (posteriorAlpha*posteriorBeta)/(((posteriorAlpha + posteriorBeta)^2)*(posteriorAlpha + posteriorBeta + 1))
posteriorSD = sqrt(posteriorVariance)

yMat = cbind(yCoordPrior, yCoordLikelihood, yCoordPosterior)
xMat = cbind(xCoord, xCoord, xCoord)

matplot(x = xMat, y = yMat, type = "l", xlab = expression(theta), 
     ylab = expression(paste("p(", theta, ")")), main = paste("Red = MAP = ", round(x = mapEstimate, digits = 2), 
                                                              "; Green = EAP = ", round(x = posteriorMean, digits = 2),
                                                              "; SD = ", round(x = posteriorSD, digits = 2)),
     lwd = 3)
legend(x = 0, y = 1, legend = c("Prior", "Likelihood", "Posterior"), col = 1:3, lty = 1:3, lwd = 1:3)