Bayesian Inference Theory
Overview
Bayesian inference treats probability as an extension of logic under uncertainty, following E.T. Jaynes' "Probability Theory: The Logic of Science." Unlike frequentist statistics, which interprets probability as long-run frequency, Bayesian probability represents degrees of belief updated by evidence.
Core Principle: Bayes' Theorem
Bayes' Theorem is the fundamental equation for updating beliefs:
$$P(\theta | D) = \frac{P(D | \theta) \times P(\theta)}{P(D)}$$
Where:
- $P(\theta | D)$ = Posterior: Updated belief about parameter $\theta$ after observing data $D$
- $P(D | \theta)$ = Likelihood: Probability of observing data $D$ given parameter $\theta$
- $P(\theta)$ = Prior: Initial belief about $\theta$ before seeing data
- $P(D)$ = Evidence: Marginal probability of data (normalization constant)
The posterior is proportional to the likelihood times the prior:
$$P(\theta | D) \propto P(D | \theta) \times P(\theta)$$
Cox's Theorems: Probability as Logic
E.T. Jaynes showed that Cox's theorems prove that any consistent system of reasoning under uncertainty must obey the rules of probability theory. This establishes Bayesian inference as the unique consistent extension of Boolean logic to uncertain propositions.
Key insights:
- Probabilities represent states of knowledge, not physical randomness
- Prior probabilities encode existing knowledge before observing new data
- Updating via Bayes' theorem is the only consistent way to learn from evidence
Conjugate Priors
A conjugate prior for a likelihood function is one that produces a posterior distribution in the same family as the prior. This enables closed-form Bayesian updates without numerical integration.
Beta-Binomial Conjugate Family
For binary outcomes (success/failure):
Prior: Beta($\alpha$, $\beta$)
$$p(\theta) = \frac{\theta^{\alpha-1} (1-\theta)^{\beta-1}}{B(\alpha, \beta)}$$
Likelihood: Binomial($n$, $\theta$) with $k$ successes
$$p(k | \theta, n) \propto \theta^k (1-\theta)^{n-k}$$
Posterior: Beta($\alpha + k$, $\beta + n - k$)
$$p(\theta | k, n) = \text{Beta}(\alpha + k, \beta + n - k)$$
Interpretation:
- $\alpha$ = "prior successes + 1"
- $\beta$ = "prior failures + 1"
- $\alpha + \beta$ = "effective sample size" of prior belief (higher = stronger prior)
- After observing data, simply add observed successes to $\alpha$ and failures to $\beta$
Common Prior Choices
1. Uniform Prior: Beta(1, 1)
- Represents complete ignorance
- All probabilities $\theta \in [0, 1]$ are equally likely
- Posterior is dominated by data
2. Jeffrey's Prior: Beta(0.5, 0.5)
- Non-informative prior invariant under reparameterization
- Recommended when no prior knowledge exists
- Slightly favors extreme values (0 or 1)
3. Informative Prior: Beta($\alpha$, $\beta$) with $\alpha, \beta > 1$
- Encodes domain knowledge from past experience
- Example: Beta(80, 20) = "strong belief in 80% success rate based on 100 trials"
- Requires more data to overcome strong priors
Posterior Statistics
Posterior Mean (Expected Value)
For Beta($\alpha$, $\beta$):
$$E[\theta | D] = \frac{\alpha}{\alpha + \beta}$$
This is the expected value of the parameter under the posterior distribution.
Posterior Mode (MAP Estimate)
Maximum A Posteriori (MAP) estimate is the most probable value:
For Beta($\alpha$, $\beta$) with $\alpha > 1, \beta > 1$:
$$\text{mode}[\theta | D] = \frac{\alpha - 1}{\alpha + \beta - 2}$$
Note: For uniform prior Beta(1, 1), there is no unique mode (flat distribution).
Posterior Variance (Uncertainty)
For Beta($\alpha$, $\beta$):
$$\text{Var}[\theta | D] = \frac{\alpha \beta}{(\alpha + \beta)^2 (\alpha + \beta + 1)}$$
Key property: Variance decreases as $\alpha + \beta$ increases (more data = more certainty).
Credible Intervals vs Confidence Intervals
Credible Interval: Bayesian probability that parameter lies in interval
- 95% credible interval: $P(a \leq \theta \leq b | D) = 0.95$
- Interpretation: "There is a 95% probability that $\theta$ is in $[a, b]$ given the data"
- Directly measures uncertainty about parameter
Confidence Interval (frequentist): Long-run frequency interpretation
- 95% confidence interval: In repeated sampling, 95% of intervals contain true $\theta$
- Cannot say: "95% probability that $\theta$ is in this specific interval"
- Measures sampling variability, not parameter uncertainty
Why credible intervals are superior: Bayesian intervals answer the question we actually care about: "What are plausible parameter values given this data?"
Posterior Predictive Distribution
The posterior predictive integrates over all possible parameter values weighted by the posterior:
$$p(\tilde{x} | D) = \int p(\tilde{x} | \theta) , p(\theta | D) , d\theta$$
For Beta-Binomial, the posterior predictive probability of success is:
$$p(\text{success} | D) = \frac{\alpha}{\alpha + \beta} = E[\theta | D]$$
This is the expected probability of success on the next trial, accounting for parameter uncertainty.
Sequential Bayesian Updating
Bayesian inference naturally handles sequential data:
- Start with prior $P(\theta)$
- Observe data batch $D_1$, compute posterior $P(\theta | D_1)$
- Use $P(\theta | D_1)$ as the new prior
- Observe data batch $D_2$, compute posterior $P(\theta | D_1, D_2)$
- Repeat indefinitely
Key insight: The final posterior is the same regardless of data order (commutativity).
This matches the PDCA cycle in the Toyota Production System:
- Plan: Specify prior distribution from standardized work
- Do: Execute process and collect data (likelihood)
- Check: Compute posterior distribution
- Act: Update standards (new prior) if needed
Choosing Priors
Non-Informative Priors
Use when you have no prior knowledge:
- Uniform Prior: Beta(1, 1) for proportions
- Jeffrey's Prior: Beta(0.5, 0.5) for invariance
- Weakly Informative: Beta(0.1, 0.1) for minimal influence
Informative Priors
Use when you have domain knowledge:
- Historical Data: Estimate $\alpha$, $\beta$ from past experiments
- Expert Elicitation: Ask domain experts for mean and certainty
- Hierarchical Priors: Learn priors from related tasks
Prior Sensitivity Analysis
Always check how results change with different priors:
- Run inference with weak prior (e.g., Beta(1, 1))
- Run inference with strong prior (e.g., Beta(50, 50))
- Compare posteriors—if drastically different, collect more data
Conjugate Families (Summary)
| Likelihood | Prior | Posterior | Use Case |
|---|---|---|---|
| Bernoulli/Binomial | Beta | Beta | Binary outcomes (success/fail) |
| Poisson | Gamma | Gamma | Count data (events per interval) |
| Normal (known variance) | Normal | Normal | Continuous data with known noise |
| Normal (unknown variance) | Normal-Inverse-Gamma | Normal-Inverse-Gamma | General continuous data |
| Multinomial | Dirichlet | Dirichlet | Categorical data (k > 2 classes) |
Bayesian vs Frequentist
| Aspect | Bayesian | Frequentist |
|---|---|---|
| Probability | Degree of belief | Long-run frequency |
| Parameters | Random variables | Fixed unknowns |
| Inference | Posterior distribution | Point estimate + SE |
| Prior knowledge | Incorporated naturally | Not allowed |
| Uncertainty | Credible intervals | Confidence intervals |
| Sequential learning | Natural | Requires recomputation |
| Small data | Works well | Often unreliable |
Practical Guidelines
When to use Bayesian inference:
- Small datasets where every observation matters
- Sequential decision-making (A/B testing, clinical trials)
- Incorporating prior knowledge or expert opinion
- Need to quantify uncertainty in predictions
- Model comparison via Bayes factors
Advantages over frequentist:
- Direct probability statements about parameters
- Natural handling of sequential data
- Automatic regularization through priors
- Principled framework for model selection
Disadvantages:
- Computationally intensive for complex models (MCMC required)
- Prior choice can influence results (requires sensitivity analysis)
- Less familiar to many practitioners
Aprender Implementation
Aprender implements conjugate priors with the following design:
use aprender::bayesian::BetaBinomial;
// Prior specification
let mut model = BetaBinomial::uniform(); // Beta(1, 1)
// Bayesian update
model.update(successes, trials);
// Posterior statistics
let mean = model.posterior_mean();
let mode = model.posterior_mode().unwrap();
let variance = model.posterior_variance();
// Credible interval
let (lower, upper) = model.credible_interval(0.95).unwrap();
// Predictive distribution
let prob = model.posterior_predictive();
See the Beta-Binomial case study for complete examples.
Further Reading
-
Jaynes, E. T. (2003). Probability Theory: The Logic of Science. Cambridge University Press.
- The foundational text on Bayesian probability as logic
-
Gelman, A., et al. (2013). Bayesian Data Analysis (3rd ed.). CRC Press.
- Comprehensive practical guide to Bayesian methods
-
McElreath, R. (2020). Statistical Rethinking (2nd ed.). CRC Press.
- Intuitive introduction with focus on causal inference
-
Murphy, K. P. (2022). Probabilistic Machine Learning: An Introduction. MIT Press.
- Modern treatment connecting Bayesian methods to ML
References
-
Cox, R. T. (1946). "Probability, Frequency and Reasonable Expectation." American Journal of Physics, 14(1), 1-13.
-
Jeffreys, H. (1946). "An Invariant Form for the Prior Probability in Estimation Problems." Proceedings of the Royal Society of London A, 186(1007), 453-461.
-
Laplace, P.-S. (1814). Essai philosophique sur les probabilités. Translated as A Philosophical Essay on Probabilities (1902).