# The Gaussian Distribution

## Table of Contents

If you have spent some time in the Machine Learning world, you might have noticed that the Gaussian or Normal distribution appears with great frequency. In this great Probabilistic Machine Learning Course{:target=“_blank”}, Professor Philipp Henning spends an entire lecture just on the Gaussian distribution, and answers the question raised in the title. This post should summarize some of the Gaussian's nice propertires and underline them with some visualization.

TLDR: Gaussians are extremely useful because of their very convenient mathematical properties.

Ideally, this set of notes will serve as a small basis for more complex and interesting topics in the realm of Bayesian Machine and Deep Learning.

{::options parse_block_html="true" /}

<details>

<summary>

Python Import Statements

</summary>

import os import glob import math import numpy as np import matplotlib.pyplot as plt from mpl_toolkits.axes_grid1 import make_axes_locatable from mpl_toolkits.mplot3d import Axes3D from matplotlib import cm from matplotlib.gridspec import GridSpec

</details>

{::options parse_block_html="false" /}

## 1. The Univariate Gaussian Distribution

### 1.1. Definition

\[

\begin{equation} \mathcal{N}(x;\mu,\sigma^2)=\frac{1}{\sigma\sqrt{2\pi}}\exp(-\frac{1}{2}(x-\mu)^2) \end{equation}\]

Often you will see the notation *x* ∼ 𝒩(*μ*,/σ/2), which means that the
random variable x is distributed according to the normal distribution
that is parameterized by *μ* and /σ/2. In code, we can write the
univariate Gaussian as follows:

We can write the univariate Gaussian in code as follows:

def univar_normal(x, mu, var): """ computes the pdf univariate normal distribution Inputs: x: data mu: mean of normal var: variance of normal """ return ((1. / np.sqrt(2 * np.pi * var)) * np.exp(-(x - mu)**2 / (2 * var)))

Some plots might give an intuition about how *μ* changes the location,
and /σ/2 the shape of the normal distribution.

{::options parse_block_html="true" /}

<details>

<summary>

Visualization Code

</summary>

x = np.linspace(-6, 6, num=100) pdf1 = univar_normal(x, 0, 1) pdf2 = univar_normal(x, -1, 0.5) pdf3 = univar_normal(x, 2, 3) fig = plt.figure() plt.plot(x, pdf1, label='mu=0, var=1') plt.plot(x, pdf2, label='mu=-1, var=0.5') plt.plot(x, pdf3, label='mu=2, var=3') plt.xlabel('x') plt.ylabel('pdf') plt.title('Univariate Gaussian with different parameters') plt.legend()

</details>

{::options parse_block_html="false" /}

Looking at the figure, we can observe that *μ* decides where on the x
axis, the mean and mode of the distribution occurs. Furthermore, lower
variance values make the distribution more peaked, so values vary less
around the mean, while larger variance values spread the distribution
out.

## 2. The Multivariate Gaussian Distribution

### 2.1. Definition

The Multivariate Gaussian Distribution is defined as follows:

\[

\begin{equation*} \mathbf{N}(\mathbf{x};\mathbf{\mu}, \Sigma)=\frac{1}{(2\pi)^{n/2}}|\Sigma|^{1/2}\exp(-\frac{1}{2}(\mathbf{x}-\mathbf{\mu})^T\Sigma^{-1}(\mathbf{x}-\mathbf{\mu})) \end{equation*}\]

which we read as the random vector **x** being distributed according to a
normal distribution with parameters **μ** and *Σ*, where **x**, **μ** ∈ *R*/n/
and *Σ* ∈ *R//n* × *n*. Furthermore, *Σ* has to be symmetric positive
definite.

A matrix *Σ* is called symmetric positive (semi-) definite if *Σ* is
symmetric (*Σ* = *Σ//T*) and *a//T//Σ* *a* ≥ 0, ∀/a/ ∈ *R*/n/.

So for any vector *a* ∈ *R*/n/ the above multiplication must yield a
non-negative number. Equivalently, we can say that *Σ*'s eigenvalues are
non-negative.

Additionally, we note that as a probability distribution
∫*N*(**x**;*μ*,/Σ/) = 1 and **N*(*x**;*μ*,/Σ/) > 0 ∀/x/ ∈ *R*/n/. *Σ* is
called the covariance matrix, while /Σ/−1 is known as the precision
matrix.

Implemented in python, a multivariate Gaussian looks as follows:

def multivar_normal(x, mu, cov): """ computes pdf of the multivariate normal distribution Args: x: data of shape dxn mean: vector of shape dx1 cov: matrix of shape dxd """ d = mean.shape[0] x_m = x - mu return (1. / (np.sqrt((2 * np.pi)**d * np.linalg.det(cov))) * np.exp(-0.5*(x_m.T @ np.linalg.inv(cov) @ x_m)))

The following function creates a meshgrid that allows us to plot the pdf of a multivariate Gaussian over two dimensions.

{::options parse_block_html="true" /}

<details>

<summary>

Creating a mesh grid to visualize a 2D Gaussian

</summary>

def create_mesh_grid_gaussian(mu, cov): """ Create mesh grid over X and Y dimension, with Z being the pdf of the multivariate normal Args: mu: mean of a 2d gaussian cov: covariance matrix of 2d gaussian """ n = 100 mu_x, mu_y = mu[0][0], mu[1][0] sigma_x, sigma_y = cov[0,0], cov[1,1] x = np.linspace(mu_x - 3*sigma_x, mu_x + 3*sigma_x, num=n) y = np.linspace(mu_y - 3*sigma_y, mu_y + 3*sigma_y, num=n) X, Y = np.meshgrid(x,y) pdf = np.zeros((n, n)) for i in range(n): for j in range(n): pdf[i,j] = multivar_normal( np.matrix([[X[i,j]], [Y[i,j]]]), mu, cov) return X, Y, pdf

</details>

{::options parse_block_html="false" /}

### 2.2. The Covariance Matrix

In the multivariate case, the mean and the covariance also determine the location and shape of the distribution respectively. However, it might be a little less intuitive how the covariance matrix influences the shape, when considering two or more dimensions. It is hence worth looking at some example that hopefully give some intuition.

We can visualize a two dimensional Gaussian with either 3D plots or contour plots. Considering the 3D plots on the left, the respective contour plot is a "bird-eye-view" from directly above. In the contour plot any two points on a given contour lines have the same density value. The dashed horizontal and vertical line on the right plot show the x and y axis.

The simplest case is a unit Gaussian, which implies a mean at the origin and the identity matrix as the covariance so that both components x and y have a variance of 1.

Next, we can scale the individual components, which will "stretch" the distribution depending on the values and create an ellipsoid shape. Notice, however, that the distribution is still aligned with the axis.

If we move away from a diagonal matrix, and insert non-zero off-diagonal entries, then we imply that the respective components are non-independent, meaning that if one changes, the other will change as well. This causes the distribution to change its orientation and it is no longer oriented along the axis.

## 3. Properties of Multivariate Gaussians

### 3.1. Connection to Exponential Families

One nice feature of Gaussians is that they are part of the Exponential
Family. Recall, that a distribution is in the exponential family if it
can be manipulated into the following form
*p/(/x*) = *h/(/x*)exp (*θ//T//T/(/x*)−/A/(*θ*)).

The general recipe for showing that a distribution belongs to the Exponential family is the following:

- Identify the parameters of the given distribution
- Identify the support of the random variable
- Take the log probability
- identify
*h/(/x*), the sufficient statistic*T/(/x*) and the natural parameters*η/(/θ*), and*A/(/η*), where in this case*x*is the random variable, and*θ*represents the original distribution parameters. - Rewrite the distribution in terms of the natural parameters.

Following these steps:

- The parameters are
**μ**, and*Σ* - The support is the real line, so
**x**∈ *R*/n/ - The term \(-\frac{1}{2}\mathbf{x}^T\Sigma^{-1}\mathbf{x}\) captures an interaction between sufficient statistic and parameter, however, we need a scalar product just between parameters and sufficient statistic. Therefore, we apply the Trace trick:

"vec" indicates the stacking of columns to obtain a vector, which means that we now have a scalar product between the parameters and sufficient statistics.

Then, we can identify the following: \[

\begin{align*} h(x) &= -\frac{D}{2}\log(2\pi) \\ \eta_1^T &= \mathbf{\mu}^T\Sigma^{-1}\\ \eta_2 &= -\frac{1}{2}\Sigma^{-1}\\ T_1(x) &= \mathbf{x} \\ T_2(x) &= \mathbf{x}\mathbf{x}^T\\ A(\eta)&=-C(\mathbf{\mu}, \Sigma)=\frac{1}{2}\mathbf{\mu}^T\Sigma^{-1}\mathbf{\mu}+\frac{1}{2}\log|\Sigma| \end{align*}\]

- Now, we need to solve for the parameters
*μ*and*Σ*, so that we can rewrite the distribution in terms of all its natural parameters.

\[

\begin{align*} \eta_1^T=\mathbf{\mu}^T\Sigma^{-1} &\Rightarrow \eta_1^T\Sigma=\mathbf{\mu}^T \Rightarrow -\frac{1}{2}\eta_1\eta_2^T=\mathbf{\mu}^T \Rightarrow \mathbf{\mu}=-\frac{1}{2}\eta_2^{-T}\eta_1 \\ \eta_2=-\frac{1}{2}\Sigma^{-1} &\Rightarrow \Sigma=-\frac{1}{2}\eta_2^{-1} \Rightarrow \Sigma^{-1}=-2\eta_2 \end{align*}\]

Now, we can use these equations and substitute into log partition function.

\[

\begin{align*} A(\mathbf{\eta})&=-C(\mathbf{\mu}, \Sigma)=\frac{1}{2}\mathbf{\mu}^T\Sigma^{-1}\mathbf{\mu}+\frac{1}{2}\log|\Sigma| \\ &= \frac{1}{2}\eta_1^T(-\frac{1}{2}\eta_2^{-T}\eta_1)+\frac{1}{2}\log\det(-\frac{1}{2}\eta_2^{-1})\\ &= -\frac{1}{4}\eta_1^T\eta_2^{-T}\eta_2-\frac{1}{2}\log\det(2\eta_2) \end{align*}\]

Then, the rewritten exponential form is: \[

\begin{align*} p(\mathbf{x}|\mathbf{\eta})=(2\pi)^{-\frac{D}{2}} \exp(\eta_1^T\mathbf{x}+Tr(\eta_2(\mathbf{x}\mathbf{x}^T))-A(\mathbf{\eta})) \end{align*}\]

The exponential family has several useful properties, that the
Wikipedia{:target=“_blank”}
article summarizes nicely. One central property of exponential families
is that all members have conjugate priors, which is especially desirable
for Bayesian methods, as distributions with conjugate priors imply that
the posterior distribution is tractable. Furthermore, it is really
simple to derive the moments of the distribution, as these are just
derivatives of the log partition function *A/(/η*).

### 3.2. Products of Gaussians are Gaussians

This property implies that when multiplying two or multiple Gaussians, the result is again a Gaussian, albeit unnormalized. Thus if we consider,

𝒩(*x*|/μ//a/,/Σ//a/) and 𝒩(*x*|μ_b, Σ_b) (note that both distributions have x as their random variable), then their product is the following:

\[

\begin{equation*} \mathbf{N}(\mathbf{x};\mathbf{\mu}, \Sigma)=\frac{1}{(2\pi)^{n/2}}|\Sigma|^{1/2}\exp(-\frac{1}{2}(\mathbf{x}-\mathbf{\mu})^T\Sigma^{-1}(\mathbf{x}-\mathbf{\mu})) \end{equation*}\]

For visualization purposes, we can consider two different Gaussian joint distributions, each of the following form:

\[

\begin{equation*} \begin{bmatrix} x \\ y\end{bmatrix} \sim \mathcal{N}(\mathbf{\mu}, \Sigma)=\mathcal{N}(\begin{bmatrix}\mu_x \\ \mu_y\end{bmatrix}, \begin{bmatrix}\Sigma_{xx} & \Sigma_{yx} \\ \Sigma_{xy} & \Sigma_{yy}\end{bmatrix}) \end{equation*}\]

The following code snippet shows how we can compute the product of two 2D-Gaussians and what this visually looks like.

{::options parse_block_html="true" /}

<details>

<summary>

Product of two gaussians

</summary>

def prod_2D_gaussian(mu_list, cov_list): """ Computes the gaussian product for an arbitrary number of Gaussians Args: mu_list: list that contains mu for each respective Gaussian cov_list: list that contains cov for each respective Gaussian Return: pdf: pdf of the multiplied gaussians """ mu1, mu2 = mu_list cov1, cov2 = cov_list precs1 = np.linalg.inv(cov1) precs2 = np.linalg.inv(cov2) cov3=np.linalg.inv(precs1+precs2) mu3=cov3 @ (precs1 @ mu1 + precs2 @ mu2) z3 = np.power(np.linalg.det(2*np.pi*(cov1+cov2)), (-0.5)) * np.exp(-0.5*(mu1-mu2).T @ np.linalg.inv(cov1+cov2) @ (mu1-mu2)) return create_mesh_grid_gaussian(mu3, cov3) def plot_2D_densities(pdfs, Xs, Ys): """ plots the densities of multiple gaussians in one figure Args: pdfs: list of pdfs of gaussians Xs: list of X grid values Ys: list of Y grid values """ color = ["orange", "blue", "black"] fig = plt.plot() for idx, pdf in enumerate(pdfs): plt.contour(Xs[idx], Ys[idx], pdf, levels=3, colors=color[idx], alpha=0.5) plt.xlabel("x") plt.ylabel("y") plt.show() cov1 = np.array([ [1, 0.7], [0.7, 1] ]) mu1 = np.array([[0], [0]]) cov2 = np.array([ [1, -0.5], [-0.5, 1] ]) mu2 = np.array([[0], [0]]) mu_list = [mu1, mu2] cov_list = [cov1, cov2] x1, y1, pdf1 = create_mesh_grid_gaussian(mu1, cov1) x2, y2, pdf2 = create_mesh_grid_gaussian(mu2, cov2) x3, y3, pdf3 = prod_2D_gaussian(mu_list, cov_list) pdfs = [pdf1, pdf2, pdf3] Xs = [x1, x2, x3] Ys = [y1, y2, y3] plot_2D_densities(pdfs, Xs, Ys)

</details>

{::options parse_block_html="false" /}

### 3.3. Affine Transformations of Gaussians are Gausians

An affine transformation is a geometric transformation that preserves
lines, parallel lines and ratios. Consider **x** ∼ 𝒩(**μ**,/Σ/), then *A*
*X* + *b* ∼ 𝒩(*A/*μ*,/A*Σ*A//T*) for any *μ* ∈ ℛ/n/, *Σ* ∈ ℛ/n*x*n/ and
any *A* ∈ ℛ/m*x*n/, *b* ∈ ℛ/m/. This means that we can project the joint
probability of two or more variables to a lower dimensional space and
the distribution in that space will still be a Gaussian. A special case
of an affine transformation that occurs very frequently, when working
with Gaussians is marginalization.

### 3.4. Marginals of Gaussians are Gaussians

Recall the definition of the joint distribution in a previous section.
When marginalizing, we are summing out all other variables except the
variable whose marginal distribution we are interested in. So if we
consider our joint distribution from before and let *A* = (1,0),
*b* = 0, then applying the previous equations of affine transformations,
we obtain the marginal

\[

\begin{align*} AZ+b\sim\mathcal{N}(A\mathbf{\mu}+b, A\Sigma A^T)=\mathcal{N}(x, \Sigma_{xx}) \end{align*}\]

Essentially for the case of multivariate Gaussians, obtaining the marginal distribution of one particular variable comes down to simply selecting the variable of interest and "dropping all others". So if we had a k dimensional space, we could define an affine transformation in form of a one-hot encoded row vector, where only the entry of the dimension we are interested in gets a 1 and all other entries are 0. Then applying this transformation to the join multivariate distribution, we would obtain the marginal distribution for this particular variable. For other distributions, marginalization most likely requires difficult integrals that are often intractable and can only be approximated.

Based on code found here{:target=“_blank”}, the figure below shows the two marginal distributions for a given joint distribution on the x and y axis respectively.

{::options parse_block_html="true" /}

<details>

<summary>

Marginalization of X and Y from their joint distribution

</summary>

n=100 cov = np.array([ [1, 0], [0, 5] ]) mu = np.array([[0],[0]]) mu_x, mu_y = mu[0][0], mu[1][0] sigma_x, sigma_y = cov[0,0], cov[1,1] X, Y, pdf = create_mesh_grid_gaussian(mu, cov) min_x = np.min(X) max_x = np.max(X) min_y = np.min(Y) max_y = np.max(Y) fig = plt.figure(figsize=(7, 7)) gs = GridSpec( 2, 2, width_ratios=[2, 1], height_ratios=[1, 2]) plt.suptitle('Marginal distributions', y=0.93) # plot contour in the bottom left ax1 = plt.subplot(gs[2]) con1 = ax1.contourf(X, Y, pdf, levels=10) ax1.set_xlabel('$x$', fontsize=13) ax1.set_ylabel('$y$', fontsize=13) ax1.yaxis.set_label_position('right') ax1.xaxis.set_label_position('top') # Plot x in top left ax2 = plt.subplot(gs[0]) x = np.linspace(mu_x - 3*sigma_x, mu_x + 3*sigma_x, num=n) px = univar_normal(x, mu_x, sigma_x) ax2.plot(x, px, 'r--', label=f'$p(x)$') ax2.legend(loc=0) ax2.set_ylabel('density', fontsize=13) ax2.yaxis.set_label_position('right') ax2.set_xlim(math.floor(min_x), math.ceil(max_x)) # Plot bottom right ax3 = plt.subplot(gs[3]) y = np.linspace(mu_y - 3*sigma_y, mu_y + 3*sigma_y, num=n) py = univar_normal(y, mu_y, sigma_y) ax3.plot(py, y, 'b--', label=f'$p(y)$') ax3.legend(loc=0) ax3.set_xlabel('density', fontsize=13) ax3.set_ylim(math.floor(min_y), math.ceil(max_y)) # plot top right ax4 = plt.subplot(gs[1]) ax4.set_visible(False) divider = make_axes_locatable(ax4) cax = divider.append_axes('left', size='20%', pad=0.05) cbar = fig.colorbar(con1, cax=cax) cbar.ax.set_ylabel('density: $p(x, y)$', fontsize=13) plt.savefig('gaussMarginal.jpg')

</details>

{::options parse_block_html="false" /}

### 3.5. Conditionals of Gaussians are Gaussians

Especially in Bayesian Statistics, conditional distributions are used very frequently. And it turns out that when working with Gaussians, conditioning also becomes extremely simple. Recall again, the joint distribution we defined earlier.

\[

\begin{align*} \begin{bmatrix} x \\ y\end{bmatrix} \sim \mathcal{N}(\mathbf{\mu}, \Sigma)=\mathcal{N}(\begin{bmatrix}\mu_x \\ \mu_y\end{bmatrix}, \begin{bmatrix}\Sigma_{xx} & \Sigma_{yx} \\ \Sigma_{xy} & \Sigma_{yy}\end{bmatrix}) \end{align*}\]

Then, the conditional distribution of x given y is defined as follows: \[

\begin{align*} p(x|y)=\mathcal{N}(\mu_{x|y}, \Sigma_{x|y}) \end{align*}\] , where \[

\begin{align*} \Sigma_{x|y}&=\Sigma_{xx}-\Sigma_{yx}\Sigma_{yy}^{-1}\Sigma_{xy}\\ \mu_{x|y}&=\mu_x+\Sigma_{yx}\Sigma_{yy}^{-1}(y-\mu_y) \end{align*}\]

The full proof is more involved but can be found here{:target=“_blank”]. However, a visualization might again demonstrate this more clearly. Based on the following code snippet{:target=“_blank”} we can visualize that, when conditioning on x in the top figure (on y in the bottom figure), the distribution (seen in black) is still a normal distribution.

{::options parse_block_html="true" /}

<details>

<summary>

Conditional 2D Gaussian in 3D

</summary>

def plot_3D_gauss_with_plane(mu, cov, x_plane, t): """ plot gaussian with a plane and gaussian slice Args: mu: mean vector of the gaussian cov: covariance matrix of the gaussian x_plane: x-value that we want to condition on and place the plane t: just a time step to save an jpg at each time step of moving plane """ mu_x, mu_y = mu[0][0], mu[1][0] sigma_x, sigma_y = cov[0,0], cov[1,1] #Create grid and multivariate normal n=100 x = np.linspace(mu_x - 3*sigma_x, mu_x + 3*sigma_x, num=n) y = np.linspace(mu_y - 3*sigma_y, mu_y + 3*sigma_y, num=n) X, Y = np.meshgrid(x,y) Z = np.zeros((n, n)) for i in range(n): for j in range(n): Z[i,j] = multivar_normal( np.matrix([[X[i,j]], [Y[i,j]]]), mu, cov) # Create plane x_p = x_plane y_p = np.linspace(mu_y - 3*sigma_y, mu_y + 3*sigma_y, num=100) z_p = np.linspace(0,0.02,500) Y_p, Z_p = np.meshgrid(y_p, z_p) # Finding closest idx values of X mesh to x_p tol = 1e-4 idx_x_p = (np.where(x < x_p+tol) and np.where(x > x_p-tol))[0][0] # Select the corresponding values of X, Y, Z (carefully switch X and Y) x_c, y_c, z_c = Y[idx_x_p], X[idx_x_p], Z[idx_x_p] # Plot fig = plt.figure() ax = fig.add_subplot(1, 1, 1, projection='3d') ax.plot_surface(X, Y, Z, cmap=cm.coolwarm,linewidth=0,zorder=0) ax.plot_surface(x_p, Y_p, Z_p, color='black',linewidth=0, alpha=0.5,zorder=10) ax.plot(x_c,y_c,z_c,zorder=5, color='black') ax.view_init(elev=20) cov = np.array([ [1, 0], [0, 1] ]) mu = np.array([[0],[0]]) xs = np.linspace(-6, 6, 100) for t, x in enumerate(xs): plot_3D_gauss_with_plane(mu, cov, x, t)

</details>

{::options parse_block_html="false" /}

<img textAlign="center" src="../assets/images/2021_11_16/slicingGaussianX.gif">

<img textAlign="center" src="../assets/images/2021_11_16/slicingGaussianY.gif">

We can do a similar visualization just with the contour plot, where the dashed line shows what we condition on.

{::options parse_block_html="true" /}

<details>

<summary>

Conditional Gaussian 2D contour plot

</summary>

def visualize_contour_conditioning(mu, cov, x_cond, y_cond, t): """ function to visualize contour of joint probability and conditional distribution Args: mu: mean of joint probability cov: covariance of joint probability x_cond: float value to condition x on y_cond_ float value to condition y on t: time step value to save the figure """ mu_x, mu_y = mu[0][0], mu[1][0] sigma_xx, sigma_yx, sigma_xy, sigma_yy = cov[0,0], cov[0,1], cov[1,0], cov[1,1] mu_x_given_y = mu_x + sigma_yx * (1/sigma_yy) * (y_cond-mu_y) cov_x_given_y = sigma_xx - sigma_yx * (1/sigma_yy) * sigma_yx mu_y_given_x = mu_y + sigma_yx * (1/sigma_xx) * (x_cond-mu_x) cov_y_given_x = sigma_yy - sigma_yx * (1/sigma_xx) * sigma_yx X, Y, pdf = create_mesh_grid_gaussian(mu, cov) min_x = np.min(X) max_x = np.max(X) min_y = np.min(Y) max_y = np.max(Y) fig = plt.figure(figsize=(7, 7)) gs = GridSpec( 2, 2, width_ratios=[2, 1], height_ratios=[1, 2]) plt.suptitle('Conditional distributions', y=0.93) # plot the contour bottom left ax1 = plt.subplot(gs[2]) con1 = ax1.contourf(X, Y, pdf, levels=10) # y=1 that is conditioned upon ax1.plot([math.floor(min_x), math.ceil(max_x)], [y_cond, y_cond], 'r--', alpha=0.5) # x=-1. that is conditioned upon ax1.plot([x_cond, x_cond], [math.floor(min_y), math.ceil(max_y)], 'b--', alpha=0.5) ax1.set_xlabel('$x$', fontsize=13) ax1.set_ylabel('$y$', fontsize=13) ax1.yaxis.set_label_position('right') ax1.xaxis.set_label_position('top') # Plot x in top left ax2 = plt.subplot(gs[0]) x = np.linspace(mu_x - 3*sigma_x, mu_x + 3*sigma_x, num=n) px = univar_normal(x, mu_x_given_y, cov_x_given_y) ax2.plot(x, px, 'r--', label=f'$p(x|y)$') ax2.legend(loc=0) ax2.set_ylabel('density', fontsize=13) ax2.yaxis.set_label_position('right') ax2.set_xlim(math.floor(min_x), math.ceil(max_x)) # Plot bottom right ax3 = plt.subplot(gs[3]) y = np.linspace(mu_y - 3*sigma_y, mu_y + 3*sigma_y, num=n) py = univar_normal(y, mu_y_given_x, cov_y_given_x) ax3.plot(py, y, 'b--', label=f'$p(y|x)$') ax3.legend(loc=0) ax3.set_xlabel('density', fontsize=13) ax3.set_ylim(math.floor(min_y), math.ceil(max_y)) # plot top right ax4 = plt.subplot(gs[1]) ax4.set_visible(False) divider = make_axes_locatable(ax4) cax = divider.append_axes('left', size='20%', pad=0.05) cbar = fig.colorbar(con1, cax=cax) cbar.ax.set_ylabel('density: $p(x, y)$', fontsize=13) plt.savefig("path" + str(t).zfill(4) + '.png') n=100 cov = np.array([ [1, 2], [2, 5] ]) mu = np.array([[0],[0]]) xs = np.linspace(-3, 3, 100) ys = np.linspace(-15, 15, 100) for t, xy in enumerate(zip(xs, ys)): visualize_contour_conditioning(mu, cov, xy[0], xy[1], t)

</details>

{::options parse_block_html="false" /}

## 4. Summary

The main take away point that is stressed in the initially mentioned lecture{:target=“_blank”} about Gaussians, is that when working with Gaussians exclusively, all probabilistic inference can be done with Linear Algebra. This is extremely convenient, because usually any marginalization or conditioning would involve complicated integrals and other intractable quantities that have to be approximated with additional methods. This is the main reason, why the Gaussian distribution occurs so frequently in Machine Learning and especially Bayesian Statistics. Many interesting models such as Gaussian Processes or methods in Bayesian Deep Learning rely heavily on the Gaussian distribution and work only because of its conveniet properties. Hence, this set of notes can hopefully serve as a small stepping stone when exploring these in greater detail.