Bayesian Linear Regression
Bayesian Linear Regression extends ordinary least squares (OLS) regression by treating coefficients as random variables with a prior distribution, enabling uncertainty quantification and natural regularization.
Theory
Model
$$ y = X\beta + \epsilon, \quad \epsilon \sim \mathcal{N}(0, \sigma^2 I) $$
Where:
- $y \in \mathbb{R}^n$: target vector
- $X \in \mathbb{R}^{n \times p}$: feature matrix
- $\beta \in \mathbb{R}^p$: coefficient vector
- $\sigma^2$: noise variance
Conjugate Prior (Normal-Inverse-Gamma)
$$ \begin{aligned} \beta &\sim \mathcal{N}(\beta_0, \Sigma_0) \ \sigma^2 &\sim \text{Inv-Gamma}(\alpha, \beta) \end{aligned} $$
Analytical Posterior
With conjugate priors, the posterior has a closed form:
$$ \begin{aligned} \beta | y, X &\sim \mathcal{N}(\beta_n, \Sigma_n) \ \text{where:} \ \Sigma_n &= (\Sigma_0^{-1} + \sigma^{-2} X^T X)^{-1} \ \beta_n &= \Sigma_n (\Sigma_0^{-1} \beta_0 + \sigma^{-2} X^T y) \end{aligned} $$
Key Properties
- Posterior mean: $\beta_n$ balances prior belief ($\beta_0$) and data evidence ($X^T y$)
- Posterior covariance: $\Sigma_n$ quantifies uncertainty
- Weak prior: As $\Sigma_0 \to \infty$, $\beta_n \to (X^T X)^{-1} X^T y$ (OLS)
- Strong prior: As $\Sigma_0 \to 0$, $\beta_n \to \beta_0$ (ignore data)
Example: Univariate Regression with Weak Prior
use aprender::bayesian::BayesianLinearRegression;
use aprender::primitives::{Matrix, Vector};
fn main() {
// Training data: y ≈ 2x + noise
let x = Matrix::from_vec(10, 1, vec![
1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0
]).unwrap();
let y = Vector::from_vec(vec![
2.1, 3.9, 6.2, 8.1, 9.8, 12.3, 13.9, 16.1, 18.2, 20.0
]);
// Create model with weak prior
let mut model = BayesianLinearRegression::new(1);
// Fit: compute analytical posterior
model.fit(&x, &y).unwrap();
// Posterior estimates
let beta = model.posterior_mean().unwrap();
let sigma2 = model.noise_variance().unwrap();
println!("β (slope): {:.4}", beta[0]); // ≈ 2.0094
println!("σ² (noise): {:.4}", sigma2); // ≈ 0.0251
// Make predictions
let x_test = Matrix::from_vec(3, 1, vec![11.0, 12.0, 13.0]).unwrap();
let predictions = model.predict(&x_test).unwrap();
println!("Prediction at x=11: {:.2}", predictions[0]); // ≈ 22.10
println!("Prediction at x=12: {:.2}", predictions[1]); // ≈ 24.11
println!("Prediction at x=13: {:.2}", predictions[2]); // ≈ 26.12
}
Output:
β (slope): 2.0094
σ² (noise): 0.0251
Prediction at x=11: 22.10
Prediction at x=12: 24.11
Prediction at x=13: 26.12
With a weak prior, the posterior mean is nearly identical to the OLS estimate.
Example: Informative Prior (Ridge-like Regularization)
use aprender::bayesian::BayesianLinearRegression;
use aprender::primitives::{Matrix, Vector};
fn main() {
// Small dataset (prone to overfitting)
let x = Matrix::from_vec(5, 1, vec![1.0, 2.0, 3.0, 4.0, 5.0]).unwrap();
let y = Vector::from_vec(vec![2.5, 4.1, 5.8, 8.2, 9.9]);
// Weak prior model
let mut weak_model = BayesianLinearRegression::new(1);
weak_model.fit(&x, &y).unwrap();
// Informative prior: β ~ N(1.5, 1.0)
let mut strong_model = BayesianLinearRegression::with_prior(
1,
vec![1.5], // Prior mean: expect slope around 1.5
1.0, // Prior precision (variance = 1.0)
3.0, // Noise shape
2.0, // Noise scale
).unwrap();
strong_model.fit(&x, &y).unwrap();
let beta_weak = weak_model.posterior_mean().unwrap();
let beta_strong = strong_model.posterior_mean().unwrap();
println!("Weak prior: β = {:.4}", beta_weak[0]);
println!("Informative prior: β = {:.4}", beta_strong[0]);
}
Output:
Weak prior: β = 2.0073
Informative prior: β = 2.0065
The informative prior shrinks the coefficient toward the prior mean (1.5), acting as L2 regularization (ridge regression).
Example: Multivariate Regression
use aprender::bayesian::BayesianLinearRegression;
use aprender::primitives::{Matrix, Vector};
fn main() {
// Two features: y ≈ 2x₁ + 3x₂ + noise
let x = Matrix::from_vec(8, 2, vec![
1.0, 1.0, // row 0
2.0, 1.0, // row 1
3.0, 2.0, // row 2
4.0, 2.0, // row 3
5.0, 3.0, // row 4
6.0, 3.0, // row 5
7.0, 4.0, // row 6
8.0, 4.0, // row 7
]).unwrap();
let y = Vector::from_vec(vec![
5.1, 7.2, 11.9, 14.1, 19.2, 21.0, 25.8, 27.9
]);
// Fit multivariate model
let mut model = BayesianLinearRegression::new(2);
model.fit(&x, &y).unwrap();
let beta = model.posterior_mean().unwrap();
let sigma2 = model.noise_variance().unwrap();
println!("β₁: {:.4}", beta[0]); // ≈ 1.9785
println!("β₂: {:.4}", beta[1]); // ≈ 3.0343
println!("σ²: {:.4}", sigma2); // ≈ 0.0262
// Predictions
let x_test = Matrix::from_vec(3, 2, vec![
9.0, 5.0, // Expected: 2*9 + 3*5 = 33
10.0, 5.0, // Expected: 2*10 + 3*5 = 35
10.0, 6.0, // Expected: 2*10 + 3*6 = 38
]).unwrap();
let predictions = model.predict(&x_test).unwrap();
for i in 0..3 {
println!("Prediction {}: {:.2}", i, predictions[i]);
}
}
Output:
β₁: 1.9785
β₂: 3.0343
σ²: 0.0262
Prediction 0: 32.98
Prediction 1: 34.96
Prediction 2: 37.99
Comparison: Bayesian vs. OLS
| Aspect | Bayesian Linear Regression | OLS Regression |
|---|---|---|
| Output | Posterior distribution over β | Point estimate β̂ |
| Uncertainty | Full posterior covariance Σₙ | Standard errors (requires additional computation) |
| Regularization | Natural via prior (e.g., ridge) | Requires explicit penalty term |
| Interpretation | Probability statements: P(β ∈ [a, b] | data) | Frequentist confidence intervals |
| Computation | Analytical (conjugate case) | Analytical (normal equations) |
| Small Data | Regularizes via prior | May overfit |
Implementation Details
Simplified Approach (Aprender v0.6)
Aprender uses a simplified diagonal prior:
- $\Sigma_0 = \frac{1}{\lambda} I$ (scalar precision $\lambda$)
- Reduces computational cost from $O(p^3)$ to $O(p)$ for prior
- Still requires $O(p^3)$ for $(X^T X)^{-1}$ via Cholesky decomposition
Algorithm
- Compute sufficient statistics: $X^T X$ (Gram matrix), $X^T y$
- Estimate noise variance: $\hat{\sigma}^2 = \frac{1}{n-p} ||y - X\beta_{OLS}||^2$
- Compute posterior precision: $\Sigma_n^{-1} = \lambda I + \frac{1}{\hat{\sigma}^2} X^T X$
- Solve for posterior mean: $\beta_n = \Sigma_n (\lambda \beta_0 + \frac{1}{\hat{\sigma}^2} X^T y)$
Numerical Stability
- Uses Cholesky decomposition to solve linear systems
- Numerically stable for well-conditioned $X^T X$
- Prior precision $\lambda > 0$ ensures positive definiteness
Bayesian Interpretation of Ridge Regression
Ridge regression minimizes: $$ L(\beta) = ||y - X\beta||^2 + \alpha ||\beta||^2 $$
This is equivalent to MAP estimation with:
- Prior: $\beta \sim \mathcal{N}(0, \frac{1}{\alpha} I)$
- Likelihood: $y \sim \mathcal{N}(X\beta, \sigma^2 I)$
Bayesian regression extends this by computing the full posterior, not just the mode.
When to Use
Use Bayesian Linear Regression when:
- You want uncertainty quantification (prediction intervals)
- You have small datasets (prior regularizes)
- You have domain knowledge (informative prior)
- You need probabilistic predictions for downstream tasks
Use OLS when:
- You only need point estimates
- You have large datasets (prior has little effect)
- You want computational speed (slightly faster than Bayesian)
Further Reading
- Kevin Murphy, Machine Learning: A Probabilistic Perspective, Chapter 7
- Christopher Bishop, Pattern Recognition and Machine Learning, Chapter 3
- Andrew Gelman et al., Bayesian Data Analysis, Chapter 14
See Also
- Normal-Inverse-Gamma Inference - Conjugate prior details
- Ridge Regression - Frequentist regularization (coming soon)
- Bayesian Model Comparison - Marginal likelihood (coming soon)