Bayesian Logistic Regression
Bayesian Logistic Regression extends maximum likelihood logistic regression by treating coefficients as random variables with a prior distribution, enabling uncertainty quantification for classification tasks.
Theory
Model
$$ y \sim \text{Bernoulli}(\sigma(X\beta)), \quad \sigma(z) = \frac{1}{1 + e^{-z}} $$
Where:
- $y \in {0, 1}^n$: binary labels
- $X \in \mathbb{R}^{n \times p}$: feature matrix
- $\beta \in \mathbb{R}^p$: coefficient vector
- $\sigma$: sigmoid (logistic) function
Prior (Gaussian)
$$ \beta \sim \mathcal{N}(0, \lambda^{-1} I) $$
Where $\lambda$ is the precision (inverse variance). Higher $\lambda$ → stronger regularization.
Posterior Approximation (Laplace)
The posterior $p(\beta | y, X)$ is non-conjugate and has no closed form. The Laplace approximation fits a Gaussian at the posterior mode (MAP):
$$ \beta | y, X \approx \mathcal{N}(\beta_{\text{MAP}}, H^{-1}) $$
Where:
- $\beta_{\text{MAP}}$: maximum a posteriori estimate
- $H$: Hessian of the negative log-posterior at $\beta_{\text{MAP}}$
MAP Estimation
Find $\beta_{\text{MAP}}$ by maximizing the log-posterior:
$$ \begin{aligned} \log p(\beta | y, X) &= \log p(y | X, \beta) + \log p(\beta) + \text{const} \ &= \sum_{i=1}^n \left[ y_i \log \sigma(x_i^T \beta) + (1 - y_i) \log(1 - \sigma(x_i^T \beta)) \right] - \frac{\lambda}{2} ||\beta||^2 \end{aligned} $$
Use gradient ascent:
$$ \nabla_\beta \log p(\beta | y, X) = X^T (y - p) - \lambda \beta $$
where $p_i = \sigma(x_i^T \beta)$.
Hessian (for Uncertainty)
The Hessian at $\beta_{\text{MAP}}$ is:
$$ H = X^T W X + \lambda I $$
where $W = \text{diag}(p_i (1 - p_i))$ is the Fisher information matrix.
The posterior covariance is $\Sigma = H^{-1}$.
Example: Binary Classification with Weak Prior
use aprender::bayesian::BayesianLogisticRegression;
use aprender::primitives::{Matrix, Vector};
fn main() {
// Training data: y = 1 if x > 0, else 0
let x = Matrix::from_vec(8, 1, vec![
-2.0, -1.5, -1.0, -0.5, 0.5, 1.0, 1.5, 2.0
]).unwrap();
let y = Vector::from_vec(vec![
0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0
]);
// Create model with weak prior (precision = 0.1)
let mut model = BayesianLogisticRegression::new(0.1);
// Fit: compute MAP estimate and Hessian
model.fit(&x, &y).unwrap();
// MAP estimate
let beta = model.coefficients_map().unwrap();
println!("β (coefficient): {:.4}", beta[0]); // ≈ 1.4765
// Make predictions
let x_test = Matrix::from_vec(3, 1, vec![-1.0, 0.0, 1.0]).unwrap();
let probas = model.predict_proba(&x_test).unwrap();
println!("P(y=1 | x=-1.0): {:.4}", probas[0]); // ≈ 0.1860
println!("P(y=1 | x= 0.0): {:.4}", probas[1]); // ≈ 0.5000
println!("P(y=1 | x= 1.0): {:.4}", probas[2]); // ≈ 0.8140
}
Output:
β (coefficient): 1.4765
P(y=1 | x=-1.0): 0.1860
P(y=1 | x= 0.0): 0.5000
P(y=1 | x= 1.0): 0.8140
Example: Uncertainty Quantification
The Laplace approximation provides credible intervals for predicted probabilities:
use aprender::bayesian::BayesianLogisticRegression;
use aprender::primitives::{Matrix, Vector};
fn main() {
// Small dataset (higher uncertainty)
let x = Matrix::from_vec(6, 1, vec![
-1.5, -1.0, -0.5, 0.5, 1.0, 1.5
]).unwrap();
let y = Vector::from_vec(vec![0.0, 0.0, 0.0, 1.0, 1.0, 1.0]);
let mut model = BayesianLogisticRegression::new(0.1);
model.fit(&x, &y).unwrap();
// Predict with 95% credible intervals
let x_test = Matrix::from_vec(2, 1, vec![-2.0, 2.0]).unwrap();
let probas = model.predict_proba(&x_test).unwrap();
let (lower, upper) = model.predict_proba_interval(&x_test, 0.95).unwrap();
for i in 0..2 {
println!(
"x={:.1}: P(y=1)={:.4}, 95% CI=[{:.4}, {:.4}]",
x_test.get(i, 0), probas[i], lower[i], upper[i]
);
}
}
Output:
x=-2.0: P(y=1)=0.0433, 95% CI=[0.0007, 0.7546]
x= 2.0: P(y=1)=0.9567, 95% CI=[0.2454, 0.9993]
The credible intervals are wide due to the small dataset, reflecting high posterior uncertainty.
Example: Prior Regularization
The prior precision $\lambda$ acts as L2 regularization (ridge penalty):
use aprender::bayesian::BayesianLogisticRegression;
use aprender::primitives::{Matrix, Vector};
fn main() {
// Tiny dataset (4 samples)
let x = Matrix::from_vec(4, 1, vec![-1.0, -0.3, 0.3, 1.0]).unwrap();
let y = Vector::from_vec(vec![0.0, 0.0, 1.0, 1.0]);
// Weak prior (low regularization)
let mut weak_model = BayesianLogisticRegression::new(0.1);
weak_model.fit(&x, &y).unwrap();
// Strong prior (high regularization)
let mut strong_model = BayesianLogisticRegression::new(2.0);
strong_model.fit(&x, &y).unwrap();
let beta_weak = weak_model.coefficients_map().unwrap();
let beta_strong = strong_model.coefficients_map().unwrap();
println!("Weak prior (λ=0.1): β = {:.4}", beta_weak[0]);
println!("Strong prior (λ=2.0): β = {:.4}", beta_strong[0]);
}
Output:
Weak prior (λ=0.1): β = 1.4927
Strong prior (λ=2.0): β = 0.1519
The strong prior shrinks the coefficient toward zero, preventing overfitting on the tiny dataset.
Comparison: Bayesian vs. MLE Logistic Regression
| Aspect | Bayesian (Laplace) | Maximum Likelihood |\n|--------|--------------------|--------------------| | Output | Posterior distribution over β | Point estimate β̂ | | Uncertainty | Credible intervals via $H^{-1}$ | Standard errors (asymptotic) | | Regularization | Natural via prior (λ) | Requires explicit penalty | | Interpretation | Posterior probability: $p(\beta | \text{data})$ | Frequentist confidence intervals | | Computation | Gradient ascent + Hessian | Gradient descent (IRLS) | | Small Data | Regularizes via prior | May overfit |
Implementation Details
Laplace Approximation Algorithm
- Initialize: $\beta \leftarrow 0$
- Gradient Ascent (find MAP):
- Repeat until convergence:
- Compute predictions: $p_i = \sigma(x_i^T \beta)$
- Compute gradient: $\nabla = X^T (y - p) - \lambda \beta$
- Update: $\beta \leftarrow \beta + \eta \nabla$ (learning rate $\eta$)
- Repeat until convergence:
- Compute Hessian:
- $W = \text{diag}(p_i (1 - p_i))$
- $H = X^T W X + \lambda I$
- Store: $\beta_{\text{MAP}}$ and $H$
Credible Intervals for Predictions
For a test point $x_*$:
- Compute linear predictor variance: $\text{Var}(x_^T \beta) = x_^T H^{-1} x_*$
- Compute z-score for desired level (e.g., 1.96 for 95%)
- Compute interval for $z_* = x_*^T \beta$:
- $z_{\text{lower}} = z_* - 1.96 \sqrt{\text{Var}(z_*)}$
- $z_{\text{upper}} = z_* + 1.96 \sqrt{\text{Var}(z_*)}$
- Apply sigmoid to get probability bounds:
- $p_{\text{lower}} = \sigma(z_{\text{lower}})$
- $p_{\text{upper}} = \sigma(z_{\text{upper}})$
Numerical Stability
- Cholesky decomposition to solve $H v = x_*$ (avoids explicit inversion)
- Gradient averaging by number of samples for stability
- Convergence check on parameter updates (tolerance $10^{-4}$)
Bayesian Interpretation of Ridge Regularization
Logistic regression with L2 penalty minimizes:
$$ L(\beta) = -\sum_{i=1}^n \left[ y_i \log \sigma(x_i^T \beta) + (1 - y_i) \log(1 - \sigma(x_i^T \beta)) \right] + \frac{\lambda}{2} ||\beta||^2 $$
This is equivalent to MAP estimation with Gaussian prior $\beta \sim \mathcal{N}(0, \lambda^{-1} I)$.
Bayesian logistic regression extends this by computing the full posterior, not just the mode.
When to Use
Use Bayesian Logistic Regression when:
- You want uncertainty quantification for predictions
- You have small datasets (prior regularizes)
- You need probabilistic predictions with confidence
- You want interpretable regularization via priors
Use MLE Logistic Regression when:
- You only need point estimates and class labels
- You have large datasets (prior has little effect)
- You want computational speed (no Hessian computation)
Limitations
Laplace Approximation:
- Assumes posterior is Gaussian (may be poor for highly skewed posteriors)
- Only captures first-order uncertainty (ignores higher moments)
- Requires MAP convergence (may fail for ill-conditioned problems)
For Better Posterior Estimates:
- Use MCMC (Phase 2) for full posterior samples
- Use Variational Inference (Phase 2) for scalability
- Use Expectation Propagation for non-Gaussian posteriors
Further Reading
- Kevin Murphy, Machine Learning: A Probabilistic Perspective, Chapter 8
- Christopher Bishop, Pattern Recognition and Machine Learning, Chapter 4
- Radford Neal, Bayesian Learning for Neural Networks (Laplace approximation)
See Also
- Bayesian Linear Regression - Conjugate case with analytical posterior
- Logistic Regression - Maximum likelihood baseline (coming soon)
- MCMC Methods - Full posterior sampling (Phase 2)