Deriving Maximum Likelihood Estimation from Information Theory

In fields such as statistics and machine learning, Maximum Likelihood Estimation (MLE) is one of the most commonly used methods for parameter estimation. There are many reasons why we favor MLE: beyond its intuitive appeal, it also has excellent mathematical properties such as consistency, invariance, and asymptotic normality. This article offers another perspective on understanding MLE: deriving it from information theory.

Assume that the observed data $x_1, \dots, x_n$ comes from a distribution $X \sim p_\theta$, where $p_\theta$ is the probability density function (pdf) of the distribution, and $\theta$ is the parameter to be estimated. From the data, we can derive its empirical distribution, whose empirical cumulative density function can be defined as:

\[\text{Pr}(X \leq x) = \hat{F}(x) = \frac{1}{n} \sum_{i=1}^{n} \mathbb{1}(x = x_i)\]

where $\mathbb{1}$ is the indicator function. The empirical probability density function¹ is:

\[\hat{p}(x) = \frac{1}{n}\sum_{i=1}^n \delta(x = x_i)\]

where $\delta$ is the Dirac delta function.

We want the estimated distribution $p_\theta$ to align with the observed actual data as closely as possible; in other words, $p_\theta$ should be as similar as possible to the empirical distribution $\hat{p}$. But how can we measure the similarity between two distributions? Information theory provides a tool: the distance between distributions, known as the KL divergence (see my Note on Mutual Information for its definition and interpretation). To make $p_\theta$ as close as possible to $\hat{p}$, we minimize the KL divergence between these two distributions:

\[\begin{align*} D_{KL}(\hat{p} \mid\mid p_\theta) & = \mathbb{E}_{X \sim \hat{p}} \left[ \log \frac{\hat{p}(X)}{p_\theta(X)} \right] \\ & = \mathbb{E}_{X \sim \hat{p}} [\log \hat{p}(X)] - \mathbb{E}_{X \sim \hat{p}} [\log p_\theta(X)] \end{align*}\]

Note that $\mathbb{E}_{X \sim \hat{p}} [\log \hat{p}(X)]$ is independent of $\theta$, so we have:

\[\begin{align*} \min_\theta D_{KL}(\hat{p} \mid\mid p_\theta) & \Leftrightarrow \max_\theta \mathbb{E}_{X \sim \hat{p}} [\log p_\theta(X)] \\ & \Leftrightarrow \max_\theta \frac{1}{n} \sum_{i=1}^n \int \log p_\theta(x) \delta(x = x_i) \, dx \\ & \Leftrightarrow \max_\theta \frac{1}{n} \sum_{i=1}^n \log p_\theta(x_i) \end{align*}\]

That is, we maximize the log-likelihood function. Thus, starting from minimizing the distance between the empirical distribution and the estimated distribution, we have derived Maximum Likelihood Estimation.

This is a generalized function, not directly derived from the cumulative density function (cdf), but it satisfies the properties of a probability density function. ↩

Enjoy Reading This Article?