Lecture 1

- Bayesian Statistics: A Definition
- Posterior Distributions
- Bayesian Updating

…but, before we formally begin…

- Thomas Bayes was born in 1701 in London
- His father was a nonconformist minister
- Nonconformists were Protestant Christians who did not conform to the doctrines of the Church of England

- Bayes was a Presbyterian minister
- Presbyterians are a Protestant denomination that follows a democratic system of church government

- Bayes was a mathematician and philosopher
- He was a fellow of the Royal Society
- He was a friend of Richard Price, a famous moral philosopher

- Also…Bayes’ methods make for a great children’s book!:

- Bayesian statistical analysis refers to the use of models where some or all of the parameters are treated as random components
- Each parameter comes from some type of distribution

- The likelihood function of the data is then augmented with an additional term that represents the likelihood of the prior distribution for each parameter
- Think of this as saying each parameter has a certain likelihood – the height of the prior distribution

- The final estimates are then considered summaries of the posterior distribution of the parameter, conditional on the data
- In practice, we use these estimates to make inferences, just as is done when using non-Bayesian approaches (e.g., maximum likelihood/least squares)

Bayesian methods get used because of the

*relative*accessibility of one method of estimation (MCMC – to be discussed shortly)There are four main reasons why people use MCMC:

- Missing data
- Lack of software capable of handling large sized analyses
- New models/generalizations of models not available in software
- Philosoplyical Reasons (e.g., membership in the cult of Bayes)

- The use of Bayesian statistics has been controversial, historically (but less so today)
- The use of certain prior distributions can produce results that are biased or reflect subjective judgment rather than objective science

- Most MCMC estimation methods are computationally intensive
- Until very recently, very few methods available for those who aren’t into programming in Fortran, C, or C++

- Understanding of what Bayesian methods had been very limited outside the field of mathematical statistics (but that is changing now)
- Over the past 20 years, Bayesian methods have become widespread – making new models estimable and becoming standard in some social science fields (quantitative psychology and educational measurement)

Bayesian methods rely on Bayes’ Theorem

\[P (A \mid B) = \frac{P(B\mid A)P(A)}{P(B)} \propto P(B\mid A)P(A)\] Here:

- \(P(A)\) is the
__prior distribution__(pdf) of A (i.e., WHY THINGS ARE BAYESIAN) - \(P(B)\) is the
__marginal distribution__(pdf) of B - \(P(B \mid A)\) is the
__conditional distribution__(pdf) of B, given A - \(P (A \mid B)\)is the
__posterior distribution__(pdf) of A, given B

Suppose we wanted to assess the probability of rolling a one on a six-sided die: \[p_1 = P(D=1)\]

We then collect a sample of data \(\boldsymbol{X} = \{0,1,0,1,1 \}\)

- These are independent tosses of the die

The posterior distribution of the probability of a one conditional on the data is: \[P(p_1 \mid \boldsymbol{X})\]

We can determine this via Bayes theorem: \[P(p_1 \mid \boldsymbol{X}) = \frac{P(\boldsymbol{X} \mid p_1)P(p_1)}{P(\boldsymbol{X})} \propto P(\boldsymbol{X} \mid p_1)P(p_1)\]

The likelihood of the data given the parameter:

\[P(\boldsymbol{X} \mid p_1) = \prod_{i=1}^N p_1^{X_i} \left(1-p_1\right)^{(1-X_i)}\]

- Any given roll of the dice \(X_i\) is a Bernoulli variable \(X_i \sim B(p_1)\)
- A “success” is defined by rolling a one

- The product in the likelihood function comes from each roll being independent
- The outcome of a roll does not depend on previous or future rolls

We must now pick the prior distribution of \(p_1\):

\[P(p_1)\]

- Our choice is subjective: Many distributions to choose from
- What we know is that for a “fair” die, the probability of rolling a one is \(\frac{1}{6}\)
- But…probability is not a distribution

- Instead, let’s consider a Beta distribution \(p_1 \sim Beta\left(\alpha, \beta\right)\)

For parameters that range between zero and one (or two finite end points), the Beta distribution makes a good choice for a prior:

\[P(p_1) = \frac{\left( p_1\right)^{\alpha-1} \left(1-p_1 \right)^{\beta1-1}}{B\left(\alpha, \beta\right)}, \] where:

\[B\left(\alpha, \beta\right) = \frac{\Gamma\left(\alpha\right)\Gamma\left(\beta\right)}{\Gamma\left(\alpha+\beta\right)}, \] and,

\[\Gamma\left(z \right) = \int_0^\infty t^{z-1} e^{-t}dt\]

The Beta distribution has a mean of \(\frac{\alpha}{\alpha+\beta}\)

- The parameters \(\alpha\) and \(\beta\) are called
__hyperparameters__- Hyperparameters are parameters of prior distributions

- We can pick values of \(\alpha\) and \(\beta\) to correspond to \(\frac{1}{6}\)
- Many choices: \(\alpha=1\) and \(\beta=5\) have the same mean as \(\alpha=100\) and \(\beta=500\)

- What is the difference?
- How strongly we feel in our beliefs…as quantified by…

The Beta distribution has a variance of \(\frac{\alpha\beta}{\left(\alpha+\beta \right)^2 \left(\alpha+\beta+1 \right)}\)

- Choosing \(\alpha=1\) and \(\beta=5\) yields a prior with mean \(\frac{1}{6}\) and variance \(0.02\)
- Choosing \(\alpha=100\) and \(\beta=500\) yields a prior with mean \(\frac{1}{6}\) and variance \(0.0002\)
- The smaller prior variance means the prior is more
__informative__- Informative priors are those that have relatively small variances
__Uninformative__priors are those that have relatively large variances

Choosing a Beta distribution for a prior for \(p_1\) is *very* convenient

- When combined with Bernoulli (Binomial) data likelihood the posterior distribution can be derived analytically
- The posterior distribution is also a Beta distribution
- \(\alpha = a + \sum_{i=1}^NX_i\) (\(a\) is the hyperparameter of the prior distribution)
- \(\beta = b + N - \sum_{i=1}^NX_i\) (\(b\) is the hyperparameter of the posterior distribution)

- The Beta prior is said to be a
__conjugate prior__: A prior distribution that leads to a posterior distribution of the same family- Here, prior == Beta and posterior == Beta

To determine the estimate of \(p_1\), we use summaries of the posterior distribution:

- With prior hyperparameters \(a=1\) and \(b=5\)
- \(\hat{p}_1 = \frac{1+3}{1+3 +5+2} = \frac{4}{11} = .36\)
- SD =
`0.1388659`

- With prior hyperparameters \(a=100\) and \(b=500\)
- \(\hat{p}_1 = \frac{100+3}{(100+3) + (500+2)} = \frac{103}{605} = .17\)
- SD =
`0.0152679`

- The standard deviation (SD) of the posterior distribution is analogous to the standard error in frequentist statistics

We can use the posterior distribution as a prior!

Let’s roll a die to find out how…

Today was a very quick introduction to Bayesian concepts:

__prior distribution____hyperparameters____informative/uninformative____conjugate prior__

__data likelihood____posterior distribution__- Next we will discuss psychometric models and how they fit into Bayesian methods