The Gaussian Distribution

1. The Univariate Gaussian Distribution
- 1.1. Definition
2. The Multivariate Gaussian Distribution
- 2.1. Definition
- 2.2. The Covariance Matrix
3. Properties of Multivariate Gaussians
4. Summary

If you have spent some time in the Machine Learning world, you might have noticed that the Gaussian or Normal distribution appears with great frequency. In this great Probabilistic Machine Learning Course{:target=“_blank”}, Professor Philipp Henning spends an entire lecture just on the Gaussian distribution, and answers the question raised in the title. This post should summarize some of the Gaussian's nice propertires and underline them with some visualization.

TLDR: Gaussians are extremely useful because of their very convenient mathematical properties.

Ideally, this set of notes will serve as a small basis for more complex and interesting topics in the realm of Bayesian Machine and Deep Learning.

{::options parse_block_html="true" /}

Python Import Statements

</summary>

import os
import glob
import math
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.axes_grid1 import make_axes_locatable
from mpl_toolkits.mplot3d import Axes3D
from matplotlib import cm
from matplotlib.gridspec import GridSpec

</details>

{::options parse_block_html="false" /}

1. The Univariate Gaussian Distribution

1.1. Definition

\begin{equation} \mathcal{N}(x;\mu,\sigma^2)=\frac{1}{\sigma\sqrt{2\pi}}\exp(-\frac{1}{2}(x-\mu)^2) \end{equation}

Often you will see the notation x ∼ 𝒩(μ,/σ/2), which means that the random variable x is distributed according to the normal distribution that is parameterized by μ and /σ/2. In code, we can write the univariate Gaussian as follows:

We can write the univariate Gaussian in code as follows:

def univar_normal(x, mu, var):
    """
    computes the pdf univariate normal distribution
    Inputs:
        x: data
        mu: mean of normal
        var: variance of normal
    """
    return ((1. / np.sqrt(2 * np.pi * var)) *
            np.exp(-(x - mu)**2 / (2 * var)))

Some plots might give an intuition about how μ changes the location, and /σ/2 the shape of the normal distribution.

{::options parse_block_html="true" /}

Visualization Code

</summary>

x = np.linspace(-6, 6, num=100)
pdf1 = univar_normal(x, 0, 1)
pdf2 = univar_normal(x, -1, 0.5)
pdf3 = univar_normal(x, 2, 3)

fig = plt.figure()
plt.plot(x, pdf1, label='mu=0, var=1')
plt.plot(x, pdf2, label='mu=-1, var=0.5')
plt.plot(x, pdf3, label='mu=2, var=3')
plt.xlabel('x')
plt.ylabel('pdf')
plt.title('Univariate Gaussian with different parameters')
plt.legend()

</details>

{::options parse_block_html="false" /}

Looking at the figure, we can observe that μ decides where on the x axis, the mean and mode of the distribution occurs. Furthermore, lower variance values make the distribution more peaked, so values vary less around the mean, while larger variance values spread the distribution out.

2. The Multivariate Gaussian Distribution

2.1. Definition

The Multivariate Gaussian Distribution is defined as follows:

\begin{equation*} \mathbf{N}(\mathbf{x};\mathbf{\mu}, \Sigma)=\frac{1}{(2\pi)^{n/2}}|\Sigma|^{1/2}\exp(-\frac{1}{2}(\mathbf{x}-\mathbf{\mu})^T\Sigma^{-1}(\mathbf{x}-\mathbf{\mu})) \end{equation*}

which we read as the random vector x being distributed according to a normal distribution with parameters μ and Σ, where x, μ ∈ *R*/n/ and Σ ∈ R//n × n. Furthermore, Σ has to be symmetric positive definite.

A matrix Σ is called symmetric positive (semi-) definite if Σ is symmetric (Σ = Σ//T) and a//T//Σ a ≥ 0, ∀/a/ ∈ *R*/n/.

So for any vector a ∈ *R*/n/ the above multiplication must yield a non-negative number. Equivalently, we can say that Σ's eigenvalues are non-negative.

Additionally, we note that as a probability distribution ∫*N*(x;*μ*,/Σ/) = 1 and N*(*x;*μ*,/Σ/) > 0 ∀/x/ ∈ *R*/n/. Σ is called the covariance matrix, while /Σ/−1 is known as the precision matrix.

Implemented in python, a multivariate Gaussian looks as follows:

def multivar_normal(x, mu, cov):
    """
    computes pdf of the multivariate normal distribution
    Args:
        x: data of shape dxn
        mean: vector of shape dx1
        cov: matrix of shape dxd
    """
    d = mean.shape[0]
    x_m = x - mu
    return (1. / (np.sqrt((2 * np.pi)**d * np.linalg.det(cov))) *
            np.exp(-0.5*(x_m.T @ np.linalg.inv(cov) @ x_m)))

The following function creates a meshgrid that allows us to plot the pdf of a multivariate Gaussian over two dimensions.

{::options parse_block_html="true" /}

Creating a mesh grid to visualize a 2D Gaussian

</summary>

def create_mesh_grid_gaussian(mu, cov):
    """
    Create mesh grid over X and Y dimension, with Z being the pdf of the   multivariate normal
    Args:
        mu: mean of a 2d gaussian
        cov: covariance matrix of 2d gaussian
    """
    n = 100
    mu_x, mu_y = mu[0][0], mu[1][0]
    sigma_x, sigma_y = cov[0,0], cov[1,1]
    x = np.linspace(mu_x - 3*sigma_x, mu_x + 3*sigma_x, num=n)
    y = np.linspace(mu_y - 3*sigma_y, mu_y + 3*sigma_y, num=n)
    X, Y = np.meshgrid(x,y)

    pdf = np.zeros((n, n))

    for i in range(n):
        for j in range(n):
            pdf[i,j] = multivar_normal(
                np.matrix([[X[i,j]], [Y[i,j]]]),
                mu, cov)

    return X, Y, pdf

</details>

{::options parse_block_html="false" /}

2.2. The Covariance Matrix

In the multivariate case, the mean and the covariance also determine the location and shape of the distribution respectively. However, it might be a little less intuitive how the covariance matrix influences the shape, when considering two or more dimensions. It is hence worth looking at some example that hopefully give some intuition.

We can visualize a two dimensional Gaussian with either 3D plots or contour plots. Considering the 3D plots on the left, the respective contour plot is a "bird-eye-view" from directly above. In the contour plot any two points on a given contour lines have the same density value. The dashed horizontal and vertical line on the right plot show the x and y axis.

The simplest case is a unit Gaussian, which implies a mean at the origin and the identity matrix as the covariance so that both components x and y have a variance of 1.

Next, we can scale the individual components, which will "stretch" the distribution depending on the values and create an ellipsoid shape. Notice, however, that the distribution is still aligned with the axis.

If we move away from a diagonal matrix, and insert non-zero off-diagonal entries, then we imply that the respective components are non-independent, meaning that if one changes, the other will change as well. This causes the distribution to change its orientation and it is no longer oriented along the axis.

3. Properties of Multivariate Gaussians

3.1. Connection to Exponential Families

One nice feature of Gaussians is that they are part of the Exponential Family. Recall, that a distribution is in the exponential family if it can be manipulated into the following form p/(/x) = h/(/x)exp (θ//T//T/(/x)−/A/(θ)).

The general recipe for showing that a distribution belongs to the Exponential family is the following:

Identify the parameters of the given distribution
Identify the support of the random variable
Take the log probability
identify h/(/x), the sufficient statistic T/(/x) and the natural parameters η/(/θ), and A/(/η), where in this case x is the random variable, and θ represents the original distribution parameters.
Rewrite the distribution in terms of the natural parameters.

Following these steps:

The parameters are μ, and Σ
The support is the real line, so x ∈ *R*/n/
The term \(-\frac{1}{2}\mathbf{x}^T\Sigma^{-1}\mathbf{x}\) captures an interaction between sufficient statistic and parameter, however, we need a scalar product just between parameters and sufficient statistic. Therefore, we apply the Trace trick:

"vec" indicates the stacking of columns to obtain a vector, which means that we now have a scalar product between the parameters and sufficient statistics.

Then, we can identify the following: \[
\begin{align*} h(x) &= -\frac{D}{2}\log(2\pi) \\ \eta_1^T &= \mathbf{\mu}^T\Sigma^{-1}\\ \eta_2 &= -\frac{1}{2}\Sigma^{-1}\\ T_1(x) &= \mathbf{x} \\ T_2(x) &= \mathbf{x}\mathbf{x}^T\\ A(\eta)&=-C(\mathbf{\mu}, \Sigma)=\frac{1}{2}\mathbf{\mu}^T\Sigma^{-1}\mathbf{\mu}+\frac{1}{2}\log|\Sigma| \end{align*}
\]
Now, we need to solve for the parameters μ and Σ, so that we can rewrite the distribution in terms of all its natural parameters.

\begin{align*} \eta_1^T=\mathbf{\mu}^T\Sigma^{-1} &\Rightarrow \eta_1^T\Sigma=\mathbf{\mu}^T \Rightarrow -\frac{1}{2}\eta_1\eta_2^T=\mathbf{\mu}^T \Rightarrow \mathbf{\mu}=-\frac{1}{2}\eta_2^{-T}\eta_1 \\ \eta_2=-\frac{1}{2}\Sigma^{-1} &\Rightarrow \Sigma=-\frac{1}{2}\eta_2^{-1} \Rightarrow \Sigma^{-1}=-2\eta_2 \end{align*}

Now, we can use these equations and substitute into log partition function.

\begin{align*} A(\mathbf{\eta})&=-C(\mathbf{\mu}, \Sigma)=\frac{1}{2}\mathbf{\mu}^T\Sigma^{-1}\mathbf{\mu}+\frac{1}{2}\log|\Sigma| \\ &= \frac{1}{2}\eta_1^T(-\frac{1}{2}\eta_2^{-T}\eta_1)+\frac{1}{2}\log\det(-\frac{1}{2}\eta_2^{-1})\\ &= -\frac{1}{4}\eta_1^T\eta_2^{-T}\eta_2-\frac{1}{2}\log\det(2\eta_2) \end{align*}

Then, the rewritten exponential form is: \[

\begin{align*} p(\mathbf{x}|\mathbf{\eta})=(2\pi)^{-\frac{D}{2}} \exp(\eta_1^T\mathbf{x}+Tr(\eta_2(\mathbf{x}\mathbf{x}^T))-A(\mathbf{\eta})) \end{align*}

The exponential family has several useful properties, that the Wikipedia{:target=“_blank”} article summarizes nicely. One central property of exponential families is that all members have conjugate priors, which is especially desirable for Bayesian methods, as distributions with conjugate priors imply that the posterior distribution is tractable. Furthermore, it is really simple to derive the moments of the distribution, as these are just derivatives of the log partition function A/(/η).

3.2. Products of Gaussians are Gaussians

This property implies that when multiplying two or multiple Gaussians, the result is again a Gaussian, albeit unnormalized. Thus if we consider,

𝒩(*x*|/μ//a/,/Σ//a/) and 𝒩(*x*|μ_b, Σ_b) (note that both distributions have x as their random variable), then their product is the following:

\begin{equation*} \mathbf{N}(\mathbf{x};\mathbf{\mu}, \Sigma)=\frac{1}{(2\pi)^{n/2}}|\Sigma|^{1/2}\exp(-\frac{1}{2}(\mathbf{x}-\mathbf{\mu})^T\Sigma^{-1}(\mathbf{x}-\mathbf{\mu})) \end{equation*}

For visualization purposes, we can consider two different Gaussian joint distributions, each of the following form:

\begin{equation*} \begin{bmatrix} x \\ y\end{bmatrix} \sim \mathcal{N}(\mathbf{\mu}, \Sigma)=\mathcal{N}(\begin{bmatrix}\mu_x \\ \mu_y\end{bmatrix}, \begin{bmatrix}\Sigma_{xx} & \Sigma_{yx} \\ \Sigma_{xy} & \Sigma_{yy}\end{bmatrix}) \end{equation*}

The following code snippet shows how we can compute the product of two 2D-Gaussians and what this visually looks like.

{::options parse_block_html="true" /}

Product of two gaussians

</summary>

def prod_2D_gaussian(mu_list, cov_list):
    """
    Computes the gaussian product for an arbitrary number of Gaussians
    Args:
        mu_list: list that contains mu for each respective Gaussian
        cov_list: list that contains cov for each respective Gaussian
    Return:
        pdf: pdf of the multiplied gaussians
    """
    mu1, mu2 = mu_list
    cov1, cov2 = cov_list

    precs1 = np.linalg.inv(cov1)
    precs2 = np.linalg.inv(cov2)

    cov3=np.linalg.inv(precs1+precs2)
    mu3=cov3 @ (precs1 @ mu1 + precs2 @ mu2)
    z3 = np.power(np.linalg.det(2*np.pi*(cov1+cov2)), (-0.5)) * np.exp(-0.5*(mu1-mu2).T @ np.linalg.inv(cov1+cov2) @ (mu1-mu2))

    return create_mesh_grid_gaussian(mu3, cov3)

def plot_2D_densities(pdfs, Xs, Ys):
    """
    plots the densities of multiple gaussians in one figure
    Args:
        pdfs: list of pdfs of gaussians
        Xs: list of X grid values
        Ys: list of Y grid values
    """
    color = ["orange", "blue", "black"]
    fig = plt.plot()
    for idx, pdf in enumerate(pdfs):
        plt.contour(Xs[idx], Ys[idx], pdf, levels=3, colors=color[idx], alpha=0.5)
    plt.xlabel("x")
    plt.ylabel("y")
    plt.show()


cov1 = np.array([
    [1, 0.7],
    [0.7, 1]
])

mu1 = np.array([[0], [0]])

cov2 = np.array([
    [1, -0.5],
    [-0.5, 1]
])

mu2 = np.array([[0], [0]])

mu_list = [mu1, mu2]
cov_list = [cov1, cov2]
x1, y1, pdf1 = create_mesh_grid_gaussian(mu1, cov1)
x2, y2, pdf2 = create_mesh_grid_gaussian(mu2, cov2)
x3, y3, pdf3 = prod_2D_gaussian(mu_list, cov_list)

pdfs = [pdf1, pdf2, pdf3]
Xs = [x1, x2, x3]
Ys = [y1, y2, y3]

plot_2D_densities(pdfs, Xs, Ys)

</details>

{::options parse_block_html="false" /}

3.3. Affine Transformations of Gaussians are Gausians

An affine transformation is a geometric transformation that preserves lines, parallel lines and ratios. Consider x ∼ 𝒩(μ,/Σ/), then A X + b ∼ 𝒩(A/*μ*,/A*Σ*A//T) for any μ ∈ ℛ/n/, Σ ∈ ℛ/n*x*n/ and any A ∈ ℛ/m*x*n/, b ∈ ℛ/m/. This means that we can project the joint probability of two or more variables to a lower dimensional space and the distribution in that space will still be a Gaussian. A special case of an affine transformation that occurs very frequently, when working with Gaussians is marginalization.

3.4. Marginals of Gaussians are Gaussians

Recall the definition of the joint distribution in a previous section. When marginalizing, we are summing out all other variables except the variable whose marginal distribution we are interested in. So if we consider our joint distribution from before and let A = (1,0), b = 0, then applying the previous equations of affine transformations, we obtain the marginal

\begin{align*} AZ+b\sim\mathcal{N}(A\mathbf{\mu}+b, A\Sigma A^T)=\mathcal{N}(x, \Sigma_{xx}) \end{align*}

Essentially for the case of multivariate Gaussians, obtaining the marginal distribution of one particular variable comes down to simply selecting the variable of interest and "dropping all others". So if we had a k dimensional space, we could define an affine transformation in form of a one-hot encoded row vector, where only the entry of the dimension we are interested in gets a 1 and all other entries are 0. Then applying this transformation to the join multivariate distribution, we would obtain the marginal distribution for this particular variable. For other distributions, marginalization most likely requires difficult integrals that are often intractable and can only be approximated.

Based on code found here{:target=“_blank”}, the figure below shows the two marginal distributions for a given joint distribution on the x and y axis respectively.

{::options parse_block_html="true" /}

Marginalization of X and Y from their joint distribution

</summary>

n=100
cov = np.array([
    [1, 0],
    [0, 5]
])

mu = np.array([[0],[0]])

mu_x, mu_y = mu[0][0], mu[1][0]
sigma_x, sigma_y = cov[0,0], cov[1,1]

X, Y, pdf = create_mesh_grid_gaussian(mu, cov)

min_x = np.min(X)
max_x = np.max(X)
min_y = np.min(Y)
max_y = np.max(Y)

fig = plt.figure(figsize=(7, 7))
gs = GridSpec(
    2, 2, width_ratios=[2, 1], height_ratios=[1, 2])

plt.suptitle('Marginal distributions', y=0.93)

# plot contour in the bottom left
ax1 = plt.subplot(gs[2])
con1 = ax1.contourf(X, Y, pdf, levels=10)
ax1.set_xlabel('$x$', fontsize=13)
ax1.set_ylabel('$y$', fontsize=13)
ax1.yaxis.set_label_position('right')
ax1.xaxis.set_label_position('top')


# Plot x in top left
ax2 = plt.subplot(gs[0])
x = np.linspace(mu_x - 3*sigma_x, mu_x + 3*sigma_x, num=n)
px = univar_normal(x, mu_x, sigma_x)

ax2.plot(x, px, 'r--', label=f'$p(x)$')
ax2.legend(loc=0)
ax2.set_ylabel('density', fontsize=13)
ax2.yaxis.set_label_position('right')
ax2.set_xlim(math.floor(min_x), math.ceil(max_x))

# Plot bottom right
ax3 = plt.subplot(gs[3])
y = np.linspace(mu_y - 3*sigma_y, mu_y + 3*sigma_y, num=n)
py = univar_normal(y, mu_y, sigma_y)

ax3.plot(py, y, 'b--', label=f'$p(y)$')
ax3.legend(loc=0)
ax3.set_xlabel('density', fontsize=13)
ax3.set_ylim(math.floor(min_y), math.ceil(max_y))

# plot top right
ax4 = plt.subplot(gs[1])
ax4.set_visible(False)
divider = make_axes_locatable(ax4)
cax = divider.append_axes('left', size='20%', pad=0.05)
cbar = fig.colorbar(con1, cax=cax)
cbar.ax.set_ylabel('density: $p(x, y)$', fontsize=13)
plt.savefig('gaussMarginal.jpg')

</details>

{::options parse_block_html="false" /}

3.5. Conditionals of Gaussians are Gaussians

Especially in Bayesian Statistics, conditional distributions are used very frequently. And it turns out that when working with Gaussians, conditioning also becomes extremely simple. Recall again, the joint distribution we defined earlier.

\begin{align*} \begin{bmatrix} x \\ y\end{bmatrix} \sim \mathcal{N}(\mathbf{\mu}, \Sigma)=\mathcal{N}(\begin{bmatrix}\mu_x \\ \mu_y\end{bmatrix}, \begin{bmatrix}\Sigma_{xx} & \Sigma_{yx} \\ \Sigma_{xy} & \Sigma_{yy}\end{bmatrix}) \end{align*}

Then, the conditional distribution of x given y is defined as follows: \[

\begin{align*} p(x|y)=\mathcal{N}(\mu_{x|y}, \Sigma_{x|y}) \end{align*}

\] , where \[

\begin{align*} \Sigma_{x|y}&=\Sigma_{xx}-\Sigma_{yx}\Sigma_{yy}^{-1}\Sigma_{xy}\\ \mu_{x|y}&=\mu_x+\Sigma_{yx}\Sigma_{yy}^{-1}(y-\mu_y) \end{align*}

The full proof is more involved but can be found here{:target=“_blank”]. However, a visualization might again demonstrate this more clearly. Based on the following code snippet{:target=“_blank”} we can visualize that, when conditioning on x in the top figure (on y in the bottom figure), the distribution (seen in black) is still a normal distribution.

{::options parse_block_html="true" /}

Conditional 2D Gaussian in 3D

</summary>

def plot_3D_gauss_with_plane(mu, cov, x_plane, t):
    """
    plot gaussian with a plane and gaussian slice
    Args:
        mu: mean vector of the gaussian
        cov: covariance matrix of the gaussian
        x_plane: x-value that we want to condition on and place the plane
        t: just a time step to save an jpg at each time step of moving plane
    """
    mu_x, mu_y = mu[0][0], mu[1][0]
    sigma_x, sigma_y = cov[0,0], cov[1,1]

    #Create grid and multivariate normal
    n=100
    x = np.linspace(mu_x - 3*sigma_x, mu_x + 3*sigma_x, num=n)
    y = np.linspace(mu_y - 3*sigma_y, mu_y + 3*sigma_y, num=n)
    X, Y = np.meshgrid(x,y)
    Z = np.zeros((n, n))
    for i in range(n):
        for j in range(n):
            Z[i,j] = multivar_normal(
                np.matrix([[X[i,j]], [Y[i,j]]]),
                mu, cov)

    # Create plane
    x_p = x_plane
    y_p = np.linspace(mu_y - 3*sigma_y, mu_y + 3*sigma_y, num=100)
    z_p = np.linspace(0,0.02,500)
    Y_p, Z_p = np.meshgrid(y_p, z_p)

    # Finding closest idx values of X mesh to x_p
    tol = 1e-4
    idx_x_p = (np.where(x < x_p+tol) and np.where(x > x_p-tol))[0][0]

    # Select the corresponding values of X, Y, Z (carefully switch X and Y)
    x_c, y_c, z_c = Y[idx_x_p], X[idx_x_p], Z[idx_x_p]

    # Plot
    fig = plt.figure()
    ax = fig.add_subplot(1, 1, 1, projection='3d')
    ax.plot_surface(X, Y, Z, cmap=cm.coolwarm,linewidth=0,zorder=0)
    ax.plot_surface(x_p, Y_p, Z_p, color='black',linewidth=0, alpha=0.5,zorder=10)
    ax.plot(x_c,y_c,z_c,zorder=5, color='black')

    ax.view_init(elev=20)

cov = np.array([
    [1, 0],
    [0, 1]
])

mu = np.array([[0],[0]])

xs = np.linspace(-6, 6, 100)
for t, x in enumerate(xs):
    plot_3D_gauss_with_plane(mu, cov, x, t)

</details>

{::options parse_block_html="false" /}

<img textAlign="center" src="../assets/images/2021_11_16/slicingGaussianX.gif">

<img textAlign="center" src="../assets/images/2021_11_16/slicingGaussianY.gif">

We can do a similar visualization just with the contour plot, where the dashed line shows what we condition on.

{::options parse_block_html="true" /}

Conditional Gaussian 2D contour plot

</summary>

def visualize_contour_conditioning(mu, cov, x_cond, y_cond, t):
"""
function to visualize contour of joint probability and conditional distribution
Args:
    mu: mean of joint probability
    cov: covariance of joint probability
    x_cond: float value to condition x on
    y_cond_ float value to condition y on
    t: time step value to save the figure
"""
    mu_x, mu_y = mu[0][0], mu[1][0]
    sigma_xx, sigma_yx, sigma_xy, sigma_yy = cov[0,0], cov[0,1], cov[1,0], cov[1,1]

    mu_x_given_y = mu_x + sigma_yx * (1/sigma_yy) * (y_cond-mu_y)
    cov_x_given_y = sigma_xx - sigma_yx * (1/sigma_yy) * sigma_yx

    mu_y_given_x = mu_y + sigma_yx * (1/sigma_xx) * (x_cond-mu_x)
    cov_y_given_x = sigma_yy - sigma_yx * (1/sigma_xx) * sigma_yx

    X, Y, pdf = create_mesh_grid_gaussian(mu, cov)

    min_x = np.min(X)
    max_x = np.max(X)
    min_y = np.min(Y)
    max_y = np.max(Y)

    fig = plt.figure(figsize=(7, 7))
    gs = GridSpec(
        2, 2, width_ratios=[2, 1], height_ratios=[1, 2])
    plt.suptitle('Conditional distributions', y=0.93)

    # plot the contour bottom left
    ax1 = plt.subplot(gs[2])
    con1 = ax1.contourf(X, Y, pdf, levels=10)

    # y=1 that is conditioned upon
    ax1.plot([math.floor(min_x), math.ceil(max_x)], [y_cond, y_cond], 'r--', alpha=0.5)
    # x=-1. that is conditioned upon
    ax1.plot([x_cond, x_cond], [math.floor(min_y), math.ceil(max_y)], 'b--', alpha=0.5)

    ax1.set_xlabel('$x$', fontsize=13)
    ax1.set_ylabel('$y$', fontsize=13)
    ax1.yaxis.set_label_position('right')
    ax1.xaxis.set_label_position('top')


    # Plot x in top left
    ax2 = plt.subplot(gs[0])
    x = np.linspace(mu_x - 3*sigma_x, mu_x + 3*sigma_x, num=n)
    px = univar_normal(x, mu_x_given_y, cov_x_given_y)
    ax2.plot(x, px, 'r--', label=f'$p(x|y)$')
    ax2.legend(loc=0)
    ax2.set_ylabel('density', fontsize=13)
    ax2.yaxis.set_label_position('right')
    ax2.set_xlim(math.floor(min_x), math.ceil(max_x))

    # Plot bottom right
    ax3 = plt.subplot(gs[3])
    y = np.linspace(mu_y - 3*sigma_y, mu_y + 3*sigma_y, num=n)
    py = univar_normal(y, mu_y_given_x, cov_y_given_x)
    ax3.plot(py, y, 'b--', label=f'$p(y|x)$')
    ax3.legend(loc=0)
    ax3.set_xlabel('density', fontsize=13)
    ax3.set_ylim(math.floor(min_y), math.ceil(max_y))

    # plot top right
    ax4 = plt.subplot(gs[1])
    ax4.set_visible(False)
    divider = make_axes_locatable(ax4)
    cax = divider.append_axes('left', size='20%', pad=0.05)
    cbar = fig.colorbar(con1, cax=cax)
    cbar.ax.set_ylabel('density: $p(x, y)$', fontsize=13)

    plt.savefig("path" + str(t).zfill(4) + '.png')

n=100
cov = np.array([
    [1, 2],
    [2, 5]
])

mu = np.array([[0],[0]])

xs = np.linspace(-3, 3, 100)
ys = np.linspace(-15, 15, 100)
for t, xy in enumerate(zip(xs, ys)):
    visualize_contour_conditioning(mu, cov, xy[0], xy[1], t)

</details>

{::options parse_block_html="false" /}

4. Summary

The main take away point that is stressed in the initially mentioned lecture{:target=“_blank”} about Gaussians, is that when working with Gaussians exclusively, all probabilistic inference can be done with Linear Algebra. This is extremely convenient, because usually any marginalization or conditioning would involve complicated integrals and other intractable quantities that have to be approximated with additional methods. This is the main reason, why the Gaussian distribution occurs so frequently in Machine Learning and especially Bayesian Statistics. Many interesting models such as Gaussian Processes or methods in Bayesian Deep Learning rely heavily on the Gaussian distribution and work only because of its conveniet properties. Hence, this set of notes can hopefully serve as a small stepping stone when exploring these in greater detail.

The Gaussian Distribution

Table of Contents

1. The Univariate Gaussian Distribution

1.1. Definition

2. The Multivariate Gaussian Distribution

2.1. Definition

2.2. The Covariance Matrix

3. Properties of Multivariate Gaussians

3.1. Connection to Exponential Families

3.2. Products of Gaussians are Gaussians

3.3. Affine Transformations of Gaussians are Gausians

3.4. Marginals of Gaussians are Gaussians

3.5. Conditionals of Gaussians are Gaussians

4. Summary