Bayesian Inference Theory

Overview

Bayesian inference treats probability as an extension of logic under uncertainty, following E.T. Jaynes' "Probability Theory: The Logic of Science." Unlike frequentist statistics, which interprets probability as long-run frequency, Bayesian probability represents degrees of belief updated by evidence.

Core Principle: Bayes' Theorem

Bayes' Theorem is the fundamental equation for updating beliefs:

$$P(\theta | D) = \frac{P(D | \theta) \times P(\theta)}{P(D)}$$

Where:

$P(\theta | D)$ = Posterior: Updated belief about parameter $\theta$ after observing data $D$
$P(D | \theta)$ = Likelihood: Probability of observing data $D$ given parameter $\theta$
$P(\theta)$ = Prior: Initial belief about $\theta$ before seeing data
$P(D)$ = Evidence: Marginal probability of data (normalization constant)

The posterior is proportional to the likelihood times the prior:

$$P(\theta | D) \propto P(D | \theta) \times P(\theta)$$

Cox's Theorems: Probability as Logic

E.T. Jaynes showed that Cox's theorems prove that any consistent system of reasoning under uncertainty must obey the rules of probability theory. This establishes Bayesian inference as the unique consistent extension of Boolean logic to uncertain propositions.

Key insights:

Probabilities represent states of knowledge, not physical randomness
Prior probabilities encode existing knowledge before observing new data
Updating via Bayes' theorem is the only consistent way to learn from evidence

Conjugate Priors

A conjugate prior for a likelihood function is one that produces a posterior distribution in the same family as the prior. This enables closed-form Bayesian updates without numerical integration.

Beta-Binomial Conjugate Family

For binary outcomes (success/failure):

Prior: Beta($\alpha$, $\beta$)

$$p(\theta) = \frac{\theta^{\alpha-1} (1-\theta)^{\beta-1}}{B(\alpha, \beta)}$$

Likelihood: Binomial($n$, $\theta$) with $k$ successes

$$p(k | \theta, n) \propto \theta^k (1-\theta)^{n-k}$$

Posterior: Beta($\alpha + k$, $\beta + n - k$)

$$p(\theta | k, n) = \text{Beta}(\alpha + k, \beta + n - k)$$

Interpretation:

$\alpha$ = "prior successes + 1"
$\beta$ = "prior failures + 1"
$\alpha + \beta$ = "effective sample size" of prior belief (higher = stronger prior)
After observing data, simply add observed successes to $\alpha$ and failures to $\beta$

Common Prior Choices

1. Uniform Prior: Beta(1, 1)

Represents complete ignorance
All probabilities $\theta \in [0, 1]$ are equally likely
Posterior is dominated by data

2. Jeffrey's Prior: Beta(0.5, 0.5)

Non-informative prior invariant under reparameterization
Recommended when no prior knowledge exists
Slightly favors extreme values (0 or 1)

3. Informative Prior: Beta($\alpha$, $\beta$) with $\alpha, \beta > 1$

Encodes domain knowledge from past experience
Example: Beta(80, 20) = "strong belief in 80% success rate based on 100 trials"
Requires more data to overcome strong priors

Posterior Statistics

Posterior Mean (Expected Value)

For Beta($\alpha$, $\beta$):

$$E[\theta | D] = \frac{\alpha}{\alpha + \beta}$$

This is the expected value of the parameter under the posterior distribution.

Posterior Mode (MAP Estimate)

Maximum A Posteriori (MAP) estimate is the most probable value:

For Beta($\alpha$, $\beta$) with $\alpha > 1, \beta > 1$:

$$\text{mode}[\theta | D] = \frac{\alpha - 1}{\alpha + \beta - 2}$$

Note: For uniform prior Beta(1, 1), there is no unique mode (flat distribution).

Posterior Variance (Uncertainty)

For Beta($\alpha$, $\beta$):

$$\text{Var}[\theta | D] = \frac{\alpha \beta}{(\alpha + \beta)^2 (\alpha + \beta + 1)}$$

Key property: Variance decreases as $\alpha + \beta$ increases (more data = more certainty).

Credible Intervals vs Confidence Intervals

Credible Interval: Bayesian probability that parameter lies in interval

95% credible interval: $P(a \leq \theta \leq b | D) = 0.95$
Interpretation: "There is a 95% probability that $\theta$ is in $[a, b]$ given the data"
Directly measures uncertainty about parameter

Confidence Interval (frequentist): Long-run frequency interpretation

95% confidence interval: In repeated sampling, 95% of intervals contain true $\theta$
Cannot say: "95% probability that $\theta$ is in this specific interval"
Measures sampling variability, not parameter uncertainty

Why credible intervals are superior: Bayesian intervals answer the question we actually care about: "What are plausible parameter values given this data?"

Posterior Predictive Distribution

The posterior predictive integrates over all possible parameter values weighted by the posterior:

$$p(\tilde{x} | D) = \int p(\tilde{x} | \theta) , p(\theta | D) , d\theta$$

For Beta-Binomial, the posterior predictive probability of success is:

$$p(\text{success} | D) = \frac{\alpha}{\alpha + \beta} = E[\theta | D]$$

This is the expected probability of success on the next trial, accounting for parameter uncertainty.

Sequential Bayesian Updating

Bayesian inference naturally handles sequential data:

Start with prior $P(\theta)$
Observe data batch $D_1$, compute posterior $P(\theta | D_1)$
Use $P(\theta | D_1)$ as the new prior
Observe data batch $D_2$, compute posterior $P(\theta | D_1, D_2)$
Repeat indefinitely

Key insight: The final posterior is the same regardless of data order (commutativity).

This matches the PDCA cycle in the Toyota Production System:

Plan: Specify prior distribution from standardized work
Do: Execute process and collect data (likelihood)
Check: Compute posterior distribution
Act: Update standards (new prior) if needed

Choosing Priors

Non-Informative Priors

Use when you have no prior knowledge:

Uniform Prior: Beta(1, 1) for proportions
Jeffrey's Prior: Beta(0.5, 0.5) for invariance
Weakly Informative: Beta(0.1, 0.1) for minimal influence

Informative Priors

Use when you have domain knowledge:

Historical Data: Estimate $\alpha$, $\beta$ from past experiments
Expert Elicitation: Ask domain experts for mean and certainty
Hierarchical Priors: Learn priors from related tasks

Prior Sensitivity Analysis

Always check how results change with different priors:

Run inference with weak prior (e.g., Beta(1, 1))
Run inference with strong prior (e.g., Beta(50, 50))
Compare posteriors—if drastically different, collect more data

Conjugate Families (Summary)

Likelihood	Prior	Posterior	Use Case
Bernoulli/Binomial	Beta	Beta	Binary outcomes (success/fail)
Poisson	Gamma	Gamma	Count data (events per interval)
Normal (known variance)	Normal	Normal	Continuous data with known noise
Normal (unknown variance)	Normal-Inverse-Gamma	Normal-Inverse-Gamma	General continuous data
Multinomial	Dirichlet	Dirichlet	Categorical data (k > 2 classes)

Bayesian vs Frequentist

Aspect	Bayesian	Frequentist
Probability	Degree of belief	Long-run frequency
Parameters	Random variables	Fixed unknowns
Inference	Posterior distribution	Point estimate + SE
Prior knowledge	Incorporated naturally	Not allowed
Uncertainty	Credible intervals	Confidence intervals
Sequential learning	Natural	Requires recomputation
Small data	Works well	Often unreliable

Practical Guidelines

When to use Bayesian inference:

Small datasets where every observation matters
Sequential decision-making (A/B testing, clinical trials)
Incorporating prior knowledge or expert opinion
Need to quantify uncertainty in predictions
Model comparison via Bayes factors

Advantages over frequentist:

Direct probability statements about parameters
Natural handling of sequential data
Automatic regularization through priors
Principled framework for model selection

Disadvantages:

Computationally intensive for complex models (MCMC required)
Prior choice can influence results (requires sensitivity analysis)
Less familiar to many practitioners

Aprender Implementation

Aprender implements conjugate priors with the following design:

use aprender::bayesian::BetaBinomial;

// Prior specification
let mut model = BetaBinomial::uniform();  // Beta(1, 1)

// Bayesian update
model.update(successes, trials);

// Posterior statistics
let mean = model.posterior_mean();
let mode = model.posterior_mode().unwrap();
let variance = model.posterior_variance();

// Credible interval
let (lower, upper) = model.credible_interval(0.95).unwrap();

// Predictive distribution
let prob = model.posterior_predictive();

See the Beta-Binomial case study for complete examples.

References

Cox, R. T. (1946). "Probability, Frequency and Reasonable Expectation." American Journal of Physics, 14(1), 1-13.
Jeffreys, H. (1946). "An Invariant Form for the Prior Probability in Estimation Problems." Proceedings of the Royal Society of London A, 186(1007), 453-461.
Laplace, P.-S. (1814). Essai philosophique sur les probabilités. Translated as A Philosophical Essay on Probabilities (1902).

EXTREME TDD - The Aprender Guide to Zero-Defect Machine Learning