# Introduction to Bayesian Concepts

## Today’s Lecture Objectives

1. Bayesian Statistics: A Definition
2. Posterior Distributions
3. Bayesian Updating

…but, before we formally begin…

## Part of My Summer…

Thomas Bayes (1701-1761)

## Bayesian from Birth

A brief video to start the semester… pic.twitter.com/T9O9UzJVWT

## The Basics of Bayesian Analyses

• Bayesian statistical analysis refers to the use of models where some or all of the parameters are treated as random components
• Each parameter comes from some type of distribution
• The likelihood function of the data is then augmented with an additional term that represents the likelihood of the prior distribution for each parameter
• Think of this as saying each parameter has a certain likelihood – the height of the prior distribution
• The final estimates are then considered summaries of the posterior distribution of the parameter, conditional on the data
• In practice, we use these estimates to make inferences, just as is done when using non-Bayesian approaches (e.g., maximum likelihood/least squares)

## Why are Bayesian Methods Used?

• Bayesian methods get used because of the relative accessibility of one method of estimation (MCMC – to be discussed shortly)

• There are four main reasons why people use MCMC:

1. Missing data
2. Lack of software capable of handling large sized analyses
3. New models/generalizations of models not available in software
4. Philosoplyical Reasons (e.g., membership in the cult of Bayes)

## Perceptions and Issues with Bayesian Methods

• The use of Bayesian statistics has been controversial, historically (but less so today)
• The use of certain prior distributions can produce results that are biased or reflect subjective judgment rather than objective science
• Most MCMC estimation methods are computationally intensive
• Until very recently, very few methods available for those who aren’t into programming in Fortran, C, or C++
• Understanding of what Bayesian methods had been very limited outside the field of mathematical statistics (but that is changing now)
• Over the past 20 years, Bayesian methods have become widespread – making new models estimable and becoming standard in some social science fields (quantitative psychology and educational measurement)

## How Bayesian Statistics Work

Bayesian methods rely on Bayes’ Theorem

$P (A \mid B) = \frac{P(B\mid A)P(A)}{P(B)} \propto P(B\mid A)P(A)$

Here:

• $P(A \mid B)$ is the prior distribution (pdf) of A (i.e., WHY THINGS ARE BAYESIAN)
• $P(B)$ is the marginal distribution (pdf) of B
• $P(B \mid A)$ is the conditional distribution (pdf) of B, given A
• $P (A \mid B)$is the posterior distribution (pdf) of A, given B

## A Live Bayesian Example

• Suppose we wanted to assess the probability of rolling a one on a six-sided die: $p_1 = P(D=1)$

• We then collect a sample of data $\boldsymbol{X} = \{0,1,0,1,1 \}$

• These are independent tosses of the die
• The posterior distribution of the probability of a one conditional on the data is: $P(p_1 \mid \boldsymbol{X})$

• We can determine this via Bayes theorem: $P(p_1 \mid \boldsymbol{X}) = \frac{P(\boldsymbol{X} \mid p_1)P(p_1)}{P(\boldsymbol{X})} \propto P(\boldsymbol{X} \mid p_1)P(p_1)$

## Defining the Likelihood Function $P(\boldsymbol{X} \mid p_1)$

The likelihood of the data given the parameter:

$P(\boldsymbol{X} \mid p_1) = \prod_{i=1}^N p_1^{X_i} \left(1-p_1\right)^{(1-X_i)}$

• Any given roll of the dice $X_i$ is a Bernoulli variable $X_i \sim B(p_1)$
• A “success” is defined by rolling a one
• The product in the likelihood function comes from each roll being independent
• The outcome of a roll does not depend on previous or future rolls

## Choosing the Prior Distribution for $p_1$

We must now pick the prior distribution of $p_1$:

$P(p_1)$

• Our choice is subjective: Many distributions to choose from
• What we know is that for a “fair” die, the probability of rolling a one is $\frac{1}{6}$
• But…probability is not a distribution
• Instead, let’s consider a Beta distribution $p_1 \sim Beta\left(\alpha, \beta\right)$

## The Beta Distribution

For parameters that range between zero and one (or two finite end points), the Beta distribution makes a good choice for a prior:

$P(p_1) = \frac{\left( p_1\right)^{\alpha-1} \left(1-p_1 \right)^{\beta1-1}}{B\left(\alpha, \beta\right)},$

where:

$B\left(\alpha, \beta\right) = \frac{\Gamma\left(\alpha\right)\Gamma\left(\beta\right)}{\Gamma\left(\alpha+\beta\right)},$

and,

$\Gamma\left(z \right) = \int_0^\infty t^{z-1} e^{-t}dt$

## More Beta Distribution

The Beta distribution has a mean of $\frac{\alpha}{\alpha+\beta}$

• The parameters $\alpha$ and $\beta$ are called hyperparameters
• Hyperparameters are parameters of prior distributions
• We can pick values of $\alpha$ and $\beta$ to correspond to $\frac{1}{6}$
• Many choices: $\alpha=1$ and $\beta=5$ have the same mean as $\alpha=100$ and $\beta=500$
• What is the difference?
• How strongly we feel in our beliefs…as quantified by…

## More More Beta Distribution

The Beta distribution has a variance of $\frac{\alpha\beta}{\left(\alpha+\beta \right)^2 \left(\alpha+\beta+1 \right))}$

• Choosing $\alpha=1$ and $\beta=5$ yields a prior with mean $\frac{1}{6}$ and variance $0.02$
• Choosing $\alpha=100$ and $\beta=500$ yields a prior with mean $\frac{1}{6}$ and variance $0.0002$
• Informative priors are those that have relatively small variances
• Uninformative priors are those that have relatively large variances

## The Posterior Distribution

Choosing a Beta distribution for a prior for $p_1$ is very convenient

• When combined with Bernoulli (Binomial) data likelihood the posterior distribution can be derived analytically
• The posterior distribution is also a Beta distribution
• $\alpha = a + \sum_{i=1}^NX_i$ ($a$ is the hyperparameter of the prior distribution)
• $\beta = b + N - \sum_{i=1}^NX_i$ ($b$ is the hyperparameter of the posterior distribution)
• The Beta prior is said to be a conjugate prior: A prior distribution that leads to a posterior distribution of the same family
• Here, prior == Beta and posterior == Beta

## Bayesian Estimates are Summaries of the Posterior Distribution

To determine the estimate of $p_1$, we use summaries of the posterior distribution:

• With prior hyperparameters $a=1$ and $b=5$
• $\hat{p}_1 = \frac{1+3}{1+3 +5+2} = \frac{4}{11} = .36$
• SD = 0.1388659
• With prior hyperparameters $a=100$ and $b=500$
• $\hat{p}_1 = \frac{100+3}{(100+3) + (500+2)} = \frac{103}{605} = .17$
• SD = 0.0152679
• The standard deviation (SD) of the posterior distribution is analogous to the standard error in frequentist statistics

## Bayesian Updating

We can use the posterior distribution as a prior!

Let’s roll a die to find out how…

## Wrapping Up

Today was a very quick introduction to Bayesian concepts:

• prior distribution
• hyperparameters
• informative/uninformative
• conjugate prior
• data likelihood
• posterior distribution
• Next we will discuss psychometric models and how they fit into Bayesian methods