Relationship#
MLE and EM#
1. Maximum Likelihood Estimation (MLE):#
Given a set of independent and identically distributed observations \( \mathbf{X} = \{ x_1, x_2, \ldots, x_n \} \) and a parameterized probability distribution \( P(x | \theta) \) where \( \theta \) represents the parameters of the distribution, the likelihood function \( \mathcal{L}(\theta | \mathbf{X}) \) is given by:
The MLE estimate \( \hat{\theta} \) maximizes this likelihood:
In practice, it’s often more convenient to maximize the log-likelihood:
2. Expectation-Maximization (EM) Algorithm:#
Assume we have observed data \( \mathbf{X} \) and missing or latent data \( \mathbf{Z} \). The complete-data likelihood is given by \( P(\mathbf{X}, \mathbf{Z} | \theta) \).
The EM algorithm aims to maximize the marginal likelihood of the observed data \( P(\mathbf{X} | \theta) \), which can be tricky due to the latent variables \( \mathbf{Z} \).
The EM algorithm consists of two steps:
E-step: Calculate the expected value of the complete-data log-likelihood with respect to the posterior distribution of the latent data given the current parameter estimate:
Where \( \theta^{(t)} \) is the current estimate of the parameters.
M-step: Maximize the function \( Q \) with respect to \( \theta \):
This step involves an MLE based on the expected data from the E-step.
The process iterates between these two steps until convergence.
Relationship:#
The EM algorithm provides a way to simplify the maximization of the likelihood \( P(\mathbf{X} | \theta) \) by introducing the latent variables \( \mathbf{Z} \) and working with the complete-data likelihood \( P(\mathbf{X}, \mathbf{Z} | \theta) \). The M-step of the EM algorithm, where the maximization occurs, is a form of MLE for the expected complete data.
Certainly! Maximum A Posteriori (MAP) estimation is closely related to Bayesian statistics and involves the posterior distribution. Let’s break this down:
MAP and Posterior#
1. Bayesian Framework:#
In Bayesian statistics, we update our beliefs about a parameter \( \theta \) based on observed data \( \mathbf{X} \) using Bayes’ theorem:
Where:
\( P(\theta | \mathbf{X}) \) is the posterior distribution of \( \theta \) given the data \( \mathbf{X} \).
\( P(\mathbf{X} | \theta) \) is the likelihood of the data given the parameter \( \theta \).
\( P(\theta) \) is the prior distribution of \( \theta \), representing our beliefs about \( \theta \) before seeing the data.
\( P(\mathbf{X}) \) is the marginal likelihood of the data, which acts as a normalizing constant.
2. Maximum A Posteriori (MAP) Estimation:#
MAP estimation aims to find the value of \( \theta \) that maximizes the posterior distribution. Mathematically, the MAP estimate \( \hat{\theta}_{MAP} \) is:
Expanding using Bayes’ theorem:
Notice that we don’t include \( P(\mathbf{X}) \) in the maximization since it’s a constant with respect to \( \theta \).
Relationship:#
MAP estimation can be viewed as a regularized version of Maximum Likelihood Estimation (MLE). While MLE only considers the likelihood \( P(\mathbf{X} | \theta) \), MAP also takes into account the prior \( P(\theta) \). The prior can be thought of as introducing regularization or bias based on prior beliefs about \( \theta \).
To summarize:
MLE aims to find the parameter values that maximize the likelihood of the observed data.
MAP aims to find the parameter values that maximize the posterior distribution, which considers both the likelihood of the observed data and the prior beliefs about the parameters.
In cases where the prior is uniform (i.e., constant for all \( \theta \)), the MAP estimate reduces to the MLE.
MAP and ridge regression#
Absolutely! The relationship between Maximum A Posteriori (MAP) estimation and Ridge Regression can be best understood by looking at Ridge Regression in a Bayesian context.
1. Ridge Regression:#
Ridge Regression is a linear regression method that introduces an L2 regularization term. Given a dataset with predictors \( \mathbf{X} \) and responses \( \mathbf{y} \), the Ridge Regression estimates \( \boldsymbol{\beta} \) by minimizing:
Where:
\( ||\mathbf{y} - \mathbf{X}\boldsymbol{\beta}||^2_2 \) is the squared error term, similar to ordinary least squares.
\( \lambda \) is a regularization parameter.
\( ||\boldsymbol{\beta}||^2_2 \) is the L2 norm of the coefficients, which penalizes large coefficient values.
2. Bayesian Interpretation:#
Let’s interpret Ridge Regression in a Bayesian framework:
Assume a linear model:
Where \( \boldsymbol{\epsilon} \) is normally distributed noise with mean 0 and variance \( \sigma^2 \).
The likelihood of the data given the parameters is:
Now, let’s introduce a prior on \( \boldsymbol{\beta} \). For Ridge Regression, this prior is Gaussian with mean 0 and variance \( \tau^2 \):
The posterior distribution of \( \boldsymbol{\beta} \) given the data is:
Relationship:#
Performing a MAP estimate on this posterior is equivalent to finding \( \boldsymbol{\beta} \) that maximizes:
Taking the negative logarithm of this expression and discarding constants (which don’t affect the location of the maximum) results in an objective function that looks like the Ridge Regression cost function. The regularization parameter \( \lambda \) is related to the ratio \( \frac{\sigma^2}{\tau^2} \).
Summary:#
Ridge Regression can be viewed as a MAP estimation where:
The likelihood of the data is Gaussian with mean \( \mathbf{X}\boldsymbol{\beta} \) and variance \( \sigma^2 \).
The prior on the regression coefficients \( \boldsymbol{\beta} \) is Gaussian with mean 0 and variance \( \tau^2 \).
The L2 penalty in Ridge Regression corresponds to the assumption of a Gaussian prior on the regression coefficients. The strength of the regularization, \( \lambda \), is tied to the relative variances of the likelihood and the prior.
There are several regularization techniques and methods in statistical machine learning that have Bayesian interpretations. Let’s discuss some of the notable ones:
Lasso Regression and Laplace Prior:
Lasso Regression introduces an L1 regularization term. The objective function is:
From a Bayesian perspective, Lasso Regression can be interpreted as a MAP estimation with a Laplace prior on the coefficients. This prior leads to sparsity in the estimated coefficients.
Elastic Net Regression and a Combination of Gaussian and Laplace Priors:
Elastic Net combines L1 and L2 regularization:
From a Bayesian standpoint, this can be seen as a MAP estimation with a combination of Gaussian (for L2) and Laplace (for L1) priors on the coefficients.
Gaussian Processes and Kernel Functions:
A Gaussian Process is a collection of random variables, any finite number of which have a joint Gaussian distribution. It’s a non-parametric method used for regression and classification.
The choice of kernel function in a Gaussian process can be viewed as specifying a prior over functions.
Variational Inference:
This is an alternative to traditional sampling methods for approximating posterior distributions.
It involves finding a simpler distribution (like a Gaussian) that is close to the true posterior. The “closeness” is measured using the Kullback-Leibler (KL) divergence.
The approach can be viewed as an optimization problem where the objective is to minimize the KL divergence between the true posterior and the approximating distribution.
Dropout in Neural Networks and Bayesian Approximation:
Dropout is a regularization technique for neural networks where units are randomly “dropped out” during training.
From a Bayesian perspective, dropout in neural networks can be seen as a form of approximate Bayesian inference. It provides a form of model averaging, where each sub-network (with dropped-out units) can be viewed as a “sample” from the posterior distribution of networks.
Hierarchical Models and Hyperpriors:
In Bayesian statistics, hierarchical models involve placing priors on parameters, and then further placing priors on those priors, called hyperpriors.
This structure captures different levels of variation in the data and can be seen in models like mixed-effects models.