Bayesian analysis is all about estimating the posterior distribution
Example Data: https://stats.idre.ucla.edu/spss/library/spss-libraryhow-do-i-handle-interactions-of-continuous-andcategorical-variables/
The file DietData.csv contains data from 30 respondents who participated in a study regarding the effectiveness of three types of diets.
Variables in the data set are:
The research question: Are there differences in final weights between the three diet groups, and, if so, what are the nature of the differences?
But first, let’s look at the data
Now, your turn to answer questions:
WeightLB
) is appropriate as-is for such analysis or does it need transformed?Let’s play with models for data…
# center predictors for reasonable numbers
DietData$HeightIN60 = DietData$HeightIN-60
# full analysis model suggested by data:
FullModel = lm(formula = WeightLB ~ 1, data = DietData)
# examining assumptions and leverage of fit
# plot(FullModel)
# looking at ANOVA table
# anova(FullModel)
# looking at parameter summary
summary(FullModel)
Call:
lm(formula = WeightLB ~ 1, data = DietData)
Residuals:
Min 1Q Median 3Q Max
-62.00 -36.75 -24.00 49.00 98.00
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 171.000 9.041 18.91 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 49.52 on 29 degrees of freedom
\[\text{WeightLB}_p = \beta_0 + e_p,\] Where: \(e_p \sim N(0, \sigma^2_e)\)
Questions:
Like many compiled languages, Stan expects you to declare what type of data/parameters you are defining:
int
: Integer values (no decimals)real
: Floating point numbersvector
: A one-dimensional set of real valued numbersSometimes, additional definitions are provided giving the range of the variable (or restricting the set of starting values):
real<lower=0> sigma;
See: https://mc-stan.org/docs/reference-manual/data-types.html for more information
y ~ normal(beta0, sigma); // model for observed data
sigma ~ uniform(0, 100000); // prior for sigma
See: https://mc-stan.org/docs/functions-reference/index.html for more information
cmdstanr
and rstan
differ
cmdstanr
wants you to compile first, then run the Markov chainrstan
conducts compilation (if needed) then runs the Markov chainN
and a vector named y
cmdstanr
and rstan
cmdstanr
cmdstanr
, running the chain comes from the $sample()
function that is a part of the compiled program objectrstan
verbose
option is helpful for detecting when things break# A tibble: 3 × 10
variable mean median sd mad q5 q95 rhat ess_bulk ess_tail
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 lp__ -129. -128. 1.02 0.735 -131. -128. 1.00 17577. 22523.
2 beta0 171. 171. 9.44 9.31 155. 186. 1.00 29639. 25175.
3 sigma 51.8 51.0 7.20 6.84 41.5 64.8 1.00 29147. 24620.
lp__
is posterior log likelihood–does not necessarily need examinedess_
columns show effect sample size for chain (factoring in autocorrelation between correlations)
# A tibble: 3 × 10
variable mean median sd mad q5 q95 rhat ess_bulk ess_tail
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 lp__ -129. -128. 1.02 0.735 -131. -128. 1.00 17577. 22523.
2 beta0 171. 171. 9.44 9.31 155. 186. 1.00 29639. 25175.
3 sigma 51.8 51.0 7.20 6.84 41.5 64.8 1.00 29147. 24620.
lower upper
155.015 185.762
attr(,"credMass")
[1] 0.9
lower upper
40.2305 63.0083
attr(,"credMass")
[1] 0.9