[PRML] 1.2 Probability Theory

1.2 Probability Theory

Jan 2021

Wufei Ma
Purdue University

Notes for Patter Recognition and Machine Learning
- 1 Introduction
  - 1.1 Example: Polynomial Curve Fitting
  - 1.2 Probability Theory
  - 1.3 Model Selection
  - 1.4 The Curse of Dimensionality
  - 1.5 Decision Theory

Probability theory provides a consistent framework for the quantification and manipulation of uncertainty and forms one of the central foundations for pattern recognition.

The rules of probability. The sum rule is \[ p(X) = \sum_Y p(X, Y) \] and the product rule is \[ p(X, Y) = p(Y \mid X)p(X) \]

The Bayes' theorem. From the product rule and the similarity property, we obatin the Bayes' theorem: \[ p(Y \mid X) = \frac{P(X \mid Y)p(Y)}{p(X)} \]

Two random variables $X$ and $Y$ are said to be independent if $p(X, Y) = p(X)p(Y)$.

1.2.1 Probability densities

If the probability of a real-valued variable $x$ falling in the interval $(x, x + \delta x)$ is given by $p(x)\delta x$ for $\delta x \to 0$, then $p(x)$ is called the probability density over $x$. The probability that $x$ will lie in an interval $(a, b)$ is given by \[ p(x \in (a, b)) = \int_a^b p(x) dx \]

A formal justification of the sum and product rules for continuous variables requires a branch of mathematics called measure theory. Note that if $x$ is a discrete variable, then $p(x)$ is also called a probability mass function.

Under a nonlinear change of variable, a probability density function transforms differently from a simple function, due to the Jacobian factor. Consider a change of variable $x = g(y)$, then a function $f(x)$ becomes $\tilde{f}(y) = f(g(y))$. Now consider a probability density $p_x(x)$ that corresponds to a density $p_y(y)$. Observations falling in the range $(x, x+\delta x)$ will be transformed into the range $(y, y+\delta y)$ where $p_x(x) \delta x \simeq p_y(y)\delta y$ and hence \[ p_y(y) = p_x(x) \left\lvert \frac{dx}{dy} \right\rvert = p_x(g(y)) \lvert g'(y) \rvert \] The probability that $x$ lies in the interval $(-\infty, z)$ is given by the cumulative distribution function given by \[ P(z) = \int_{-\infty}^z p(x) dx \]

1.2.2 Expectations and covariances

The average value of $f(x)$ under a probability distribution $p(x)$ is called the expectation of $f(x)$ and will be denoted by $\mathbb{E}[f]$. \[ \mathbf{E}[f] = \lim_{N \to \infty}\frac{1}{N} \sum_{n=1}^Nf(x_n) = \begin{cases} \sum_x p(x)f(x) & \text{(discrete)} \\ \int p(x)f(x)dx & \text{(continuous)} \end{cases} \] The conditional expectation with respect to a conditional distribution is given by \[ \mathbb{E}[f \mid y] = \sum_x p(x \mid y)f(x) \]

The variance is defined by \[ \text{var}[f] = \mathbb{E}[(f(x) - \mathbb{E}[f(x)])^2] \] Expanding out the square, we have \[\begin{align*} \text{var}[f] & = \mathbb{E}[f(x)^2 - 2f(x)\mathbb{E}[f(x)] + \mathbb{E}[f(x)]^2] \\ & = \mathbb{E}[f(x)^2] - 2\mathbb{E}[f(x)]^2 + \mathbb{E}[f(x)]^2 \\ & = \mathbb{E}[f(x)^2] - \mathbb{E}[f(x)]^2 \end{align*}\]

For two random variables $x$ and $y$, the covariance is defined by \[ \text{cov}[x, y] = \mathbb{E}_{x, y}[\{x - \mathbb{E}[x]\}\{y - \mathbb{E}[y]\}] = \mathbb{E}_{x, y}[xy] - \mathbb{E}[x]\mathbb{E}[y] \] which expresses the extent to which $x$ and $y$ vary together. If $x$ and $y$ are independent, then the covariance vanishes.

If we consider the covariance of the components of a vector $\mathbf{x}$ with each other, then we use the notation $\text{cov}[\mathbf{x}] \equiv \text{cov}[\mathbf{x}, \mathbf{x}]$.

1.2.3 Bayesian probabilities

So far we view probabilities in terms of the frequencies of random, repeatable events. We shall refer to this as the classical or frequentist interpretation of probability. Now we turn to more general Bayesian view, in which probabilities provide a quantification of uncertainty.

We can apply the Bayes' theorem when making inferences about quantities such as the parameters $\mathbf{w}$ in the polynomial curve fitting example. We capture our assumptions about $\mathbf{w}$, before observing the data, in the form of a prior probability distribution $p(\mathbf{w})$. The effect of the observed data $\mathcal{D}$ is experssed as $p(\mathcal{D} \mid \mathbf{w})$. Bayes' theorem, given by \[ p(\mathbf{w} \mid \mathcal{D}) = \frac{p(\mathcal{D} \mid \mathbf{w})p(\mathbf{w})}{p(\mathcal{D})} \] allows us to evaluate the uncertainty in $\mathbf{w}$ in the form of the posterior probability $p(\mathbf{w} \mid \mathcal{D})$. The quantity $p(\mathcal{D} \mid \mathbf{w})$ is called the likelihood function.

Given the definition of likelihood, we can state Bayes' theorem in words \[ \text{posterior} \propto \text{likelihood} \times \text{prior} \] The denominator is the normalization constant and we have \[ p(\mathcal{D}) = \int p(\mathcal{D} \mid \mathbf{w})p(\mathbf{w})d\mathbf{w} \]

In a frequentist setting, $\mathbf{w}$ is considered to be a fixed parameter, as determined by an 'estimator', and error bars are obatined by considering the distribution of possible data sets $\mathcal{D}$. From the Bayesian viewpoint there is only a single data set $\mathcal{D}$, and the uncertainty in $\mathbf{w}$ is expressed through a probability distribution over $\mathbf{w}$.

A widely used frequentist estimator is maximum likelihood, in which $\mathbf{w} = \arg\max_\mathbf{w} p(\mathcal{D} \mid \mathbf{w})$. In the machine learning literature, the negative log of the likelihood function is called an error function.

1.2.4 The Gaussian Distribution

For a case of a single real-valued variable $x$, the Gaussian distribution is defined by \[ \mathcal{N}(x \mid \mu, \sigma^2) = \frac{1}{(2\pi\sigma^2)^{1/2}} \exp\left\{ -\frac{1}{2\sigma^2}(x-\mu)^2 \right\} \] governed by two parameters: the mean $\mu$ and the variance $\sigma^2$. The square root of variance, $\sigma$, is called the standard deviation and the reciprocal of the variance, $\beta = 1/\sigma^2$, is called the precision.

We can show that the Gaussian is normalized \[ \int_{-\infty}^{\infty} \mathcal{N}(x \mid \mu, \sigma^2) = 1 \] The first and second order moments are \[ \mathbb{E}[x] = \mathcal{N}(x \mid \mu, \sigma^2) x dx = \mu, \;\;\; \mathbb{E}[x^2] = \mathcal{N}(x \mid \mu, \sigma^2) x dx = \mu^2 + \sigma^2 \] The maximum of a distribution is known as its mode. For a Gaussian, the mode coincides with the mean.

The Gaussian distribution defined over a $D$-dimensional vector $\mathbf{x}$ is given by \[ \mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}, \boldsymbol{\Sigma}) = \frac{1}{(2\pi)^{D/2}} \frac{1}{\lvert \boldsymbol{\Sigma} \rvert^{1/2}} \exp \left\{ -\frac{1}{2} (\mathbf{x} - \boldsymbol{\mu})^\top \boldsymbol{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu}) \right\} \] where $\boldsymbol{\Sigma}$ is the covariance.

Suppose we are given a data set of $N$ observations $\mathbf{x} = (x_1, \dots, x_N)^\top$. Suppose the observations are drawn i.i.d. from a Gaussian distribution whose mean is $\mu$ and variance is $\sigma^2$. We would like to determine the parameters from the data set. The probability of the dataset is \[ p(\mathbf{x} \mid \mu, \sigma^2) = \prod_{n=1}^N \mathcal{N}(x_n \mid \mu, \sigma^2) \] For the moment, we shall determine values for the unknown parameters $\mu$ and $\sigma^2$ in the Gaussian by maximizing the likelihood function. The log likelihood can be written as \[ \ln p(\mathbf{x} \mid \mu, \sigma^2) = -\frac{1}{2\sigma^2} \sum_{n=1}^N (x_n - \mu)^2 - \frac{N}{2} \ln (\sigma^2) - \frac{N}{2}\ln(2\pi) \] Maximizing the log likelihood with respect to $\mu$, we obtain \[ \sum_{n=1}^N (x_n - \mu) = 0, \;\;\; \mu_\text{ML} = \frac{1}{N}\sum_{n=1}^N \] which is the sample mean. Maximizing the log likelihood with respect to $\sigma^2$, we obatin \[ \frac{1}{\sigma^3} \sum_{n=1}^N(x_n - \mu_\text{ML})^2 - \frac{N}{\sigma} = 0, \;\;\; \sigma^2_\text{ML} = \frac{1}{N} \sum_{n=1}^N(x_n - \mu_\text{ML})^2 \] which is the sample variance.

We shall show that the maximum likelihood approach systematically underestimates the variance of the distribution. This is an example of bias and is related to the problem of over-fitting in the context of polynomial curve fitting. Consider the expectations of these quantities with respect to the data set values and we have \[ \begin{align*} \mathbb{E}[\mu_\text{ML}] & = \frac{1}{N} \sum_{n=1}^N \mathbb{E}[x_n] = \mu \\ \mathbb{E}[\sigma^2_\text{ML}] & = \mathbb{E} \left[ \frac{1}{N} \sum_{n=1}^N (x_n^2 - 2x_n\mu_\text{ML} + \mu_\text{ML}^2) \right] \\ & = \mathbb{E}[x_n^2] - \frac{2}{N} \mathbb{E}\left[ \mu_\text{ML} \sum_{n=1}^N x_n \right] + \mathbb{E}[\mu_\text{ML}^2] \\ & = \mathbb{E}[x_n^2] - \mathbb{E}[\mu_\text{ML}^2] \\ & = \mu^2 + \sigma^2 - (\mu^2 + \sigma_{\mu_\text{ML}}^2) \\ & = \sigma^2 - \text{var}\left[\frac{1}{n}\sum_{n=1}^N x_n\right] \\ & = \sigma^2 - \frac{1}{n^2} \sum_{n=1}^N \text{var}[x_n] && \text{(since $x_n$ are i.i.d.)} \\ & = \sigma^2 - \frac{1}{N} \sigma^2 = \frac{N-1}{N} \sigma^2 \end{align*} \] The variance is systematically under-estimated because it is measured relative to the sample mean and not relative to the true mean. And we can see the following is an unbiased estimate for the variance parameter \[ \tilde{\sigma}^2 = \frac{N}{N-1} \sigma_\text{ML}^2 = \frac{1}{N-1} (x_n - \mu_\text{ML})^2 \]

1.2.5 Curve fitting re-visited

The goal in the curve fitting problem is to be able to make predictions for the target variable $t$ given some new value of the input variable $x$. We can express our uncertainty over the value of the target variable using a probability distribution \[ p(t \mid x, \mathbf{w}, \beta) = \mathcal{N}(t \mid y(x, \mathbf{w}), \beta^{-1}) \] where $\beta$ is the precision. We now use the training data $\{ \mathbf{x}, \mathbf{t}\}$ to determine the values of the unknown parameters $\mathbf{w}$ and $\beta$ by maximum likelihood. If the data are assumed to be drawn i.i.d., then the likelihood function is given by \[ p(\mathbf{t} \mid \mathbf{x}, \mathbf{w}, \beta) = \prod_{n=1}^N \mathcal{N}(t_n \mid y(x_n, \mathbf{w}), \beta^{-1}) \] The log likelihood is in the form \[ \ln p(\mathbf{t} \mid \mathbf{x}, \mathbf{w}, \beta) = -\frac{\beta}{2} \sum_{n=1}^N (t_n - y(x_n, \mathbf{w}))^2 + \frac{N}{2}\ln\beta - \frac{N}{2}\ln(2\pi) \] We therefore see that maximizing likelihood (under the assumption of a Gaussian noise distribution) is equivalent, so far as determining $\mathbf{w}$ is concerned, to minimizing the sum-of-squares error function. Maximizing the log likelihood with respect to $\beta$ gives \[ -\frac{1}{2} \text{RSS} + \frac{N}{2\beta} = 0, \;\;\; \frac{1}{\beta_\text{ML}} = \frac{1}{N} \sum_{n=1}^N \{y(x_n, \mathbf{w}_\text{ML}) - t_n\}^2 \] Our predictions for new values of $x$ are expressed in terms of the predictive distribution over $t$: \[ p(t \mid \mathbf{w}_\text{ML}, \beta_\text{ML}) = \mathcal{N}(t \mid y(x, \mathbf{w}_\text{ML}), \beta_\text{ML}^{-1}) \]

Now we take a step towards a more Bayesian approach and introduce a prior distribution over $\mathbf{w}$. \[ p(\mathbf{w} \mid \alpha) = \mathcal{N}(\mathbf{w} \mid \mathbf{0}, \alpha^{-1}\mathbf{I}) = \left(\frac{\alpha}{2\pi}\right)^{(M+1)/2} \exp\left\{ -\frac{\alpha}{2} \mathbf{w}^\top\mathbf{w} \right\} \] where $M+1$ is the total number of elements in $\mathbf{w}$ for an $M$-th order polynomial. Using Bayes' theorem, we have \[ p(\mathbf{w} \mid \mathbf{x}, \mathbf{t}, \alpha, \beta) \propto p(\mathbf{t} \mid \mathbf{x}, \mathbf{w}, \beta)p(\mathbf{w} \mid \alpha) \] We can determine $\mathbf{w}$ by finding the maximum by maximizing the posterior distribution. This is known as maximum posterior, or simply MAP. Taking the log likelihood, we find \[ \begin{align*} & \ln p(\mathbf{w} \mid \mathbf{x}, \mathbf{t}, \alpha, \beta) \propto \ln p(\mathbf{t} \mid \mathbf{x}, \mathbf{w}, \beta)p(\mathbf{w} \mid \alpha) \\ & = -\frac{\beta}{2} \sum_{n=1}^N (t_n - y(x_n, \mathbf{w}))^2 + \frac{N}{2}\ln\beta - \frac{N}{2}\ln(2\pi) + \frac{M+1}{2} \ln \frac{\alpha}{2\pi} - \frac{\alpha}{2} \mathbf{w}^\top\mathbf{w} \end{align*} \] Finding the maximum of the posterior (with respect to $\mathbf{w}$) is equivalent to \[ \min_\mathbf{w} \frac{\beta}{2} \sum_{n=1}^N (t_n - y(x_n, \mathbf{w}))^2 + \frac{\alpha}{2} \mathbf{w}^\top\mathbf{w} \] Thus maximizing the posterior distribution is equivalent to minimizing the regularized sum-of-squares error function with a regularization parameter $\lambda = \alpha/\beta$.

1.2.6 Bayesian curve fitting

So far we still make a point estimate of $\mathbf{w}$ and this does not yet amount to a Bayesian treatment. In a fully Bayesian approach, we should consistently apply the sum and product rules of probability that integrate over all values of $\mathbf{w}$. Such marginalizations lie at the heart of Bayesian methods for pattern recognition.

Given the trainingd data $\mathbf{x}$ and $\mathbf{t}$ and a new data point $x$, our goal is to estimate the value of $t$ by evaluating the predictive distribution $p(t \mid x, \mathbf{x}, \mathbf{t})$. We assume the parameters $\alpha$ and $\beta$ are fixed and known in advance. \[ p(t \mid x, \mathbf{x}, \mathbf{t}) = \int p(t \mid x, \mathbf{w})p(\mathbf{w} \mid \mathbf{x}, \mathbf{t}) d\mathbf{w} \] We shall see in Section 3.3 that the posterior $p(\mathbf{w} \mid \mathbf{x}, \mathbf{t})$ and the integration can be estimated analytically and the predictive distribution is given by \[ p(t \mid x, \mathbf{x}, \mathbf{t}) = \mathcal{N}(t \mid m(x), s^2(x)) \] where \[ \begin{align*} m(x) & = \beta\phi(x)^\top \mathbf{S} \sum_{n=1}^N \phi(x_n)t_n \\ s^2(x) & = \beta^{-1} + \phi(x)^\top \mathbf{S}\phi(x) \\ \mathbf{S}^{-1} & = \alpha\mathbf{I} + \beta \sum_{n=1}^N \phi(x_n)\phi(x)^\top \\ \phi(x) & = (x^0, x^1, \dots, x^M) \end{align*} \] The first term in $s^2(x)$ represents the uncertainty in the predicted value of $t$ and the second term arises from the uncertainty in $\mathbf{w}$.