Bayesian classification

Bayesian classification#

Bayesian Ideas in Discrete Settings
- Beta-Binomial conjugacy
- Conjugate pairs; Uniqueness in proportionality
Bayesian Classification (Logistic Regression)
- Non-conjugacy necessitates approximation
Rejection Sampling
- Monte Carlo sampling: A classic method to approximate posterior
Next time: probabilistic graphical models

Generative vs. Discriminative Models:
- Generative models capture the joint distribution \( p(x, y) \).
- Discriminative models condition on the input, capturing \( p(y | x) \).
- Example: Naïve Bayes (generative) vs. Logistic Regression (discriminative).

Likelihood: \( p(k|n, q) = \binom{n}{k} q^k (1 - q)^{n-k} \).
Prior: \( p(q) = \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha) \Gamma(\beta)} q^{\alpha-1} (1 - q)^{\beta-1} \).
Posterior: \( p(q|k, n) = \text{Beta}(q; k + \alpha, n - k + \beta) \).

If an unnormalized distribution is proportional to a recognized distribution, it’s essentially that distribution once normalized.
\( f(\theta) \propto g(\theta) \) implies \( f(\theta) = C \times g(\theta) \) with normalization constant \( C \).

Discriminative Classifier: \( p(y=1|x, w) = \frac{1}{1 + e^{-w^T x}} \).
Gaussian Prior: \( p(w) = \text{Normal}(0, \sigma^2I) \).
Inference: Due to non-conjugacy, use approximation methods like MCMC, Laplace Approximation, or Variational Inference.

Approximate a complex distribution with a Gaussian centered at its mode.
Procedure:
- Find the mode (MAP estimate).
- Determine curvature using the Hessian matrix.
- Form the Gaussian approximation.

Objective: The goal of rejection sampling is to generate samples from a target distribution, in this case, the posterior \( p(\theta|y) \). However, directly sampling from this distribution might be difficult. So, we use an auxiliary or proposal distribution, denoted as \( g(\theta) \), from which we can easily sample.
Un-normalised Density: Often in Bayesian statistics, the exact posterior \( p(\theta|y) \) is hard to compute due to the denominator (the evidence or marginal likelihood). Instead, we work with an un-normalised version, \( q(\theta|y) \), which is proportional to the true posterior but doesn’t necessarily integrate to 1.
Condition: The inequality \( q(\theta|y) \leq M' \times g(\theta) \) for all \( \theta \) means that when we scale the proposal distribution \( g(\theta) \) by a factor \( M' \), it should always be above or equal to the un-normalised posterior \( q(\theta|y) \). This ensures that the scaled proposal distribution “encases” or “envelopes” the target distribution.
Proposal Distribution Details: In the example you provided, \( g(\theta) = 0.5 \) is a constant function, and \( \theta \) is sampled from a uniform distribution between -1 and 1, denoted as \( U(-1,1) \). This means that for any \( \theta \) in the interval [-1, 1], the value of \( g(\theta) \) is 0.5.

Given this setup, the process of rejection sampling would be:

a. Sample a value \( \theta \) from \( U(-1,1) \).

b. Evaluate the un-normalised posterior \( q(\theta|y) \) at this \( \theta \).

c. Sample a value \( u \) from a uniform distribution between 0 and \( 0.5 \times M' \).

d. If \( u \) is less than \( q(\theta|y) \), accept \( \theta \) as a sample from the posterior. Otherwise, reject it.