Probability Distributions Part II – Gaussian, Exponential

In part one of this series, I covered two very basic probability distributions – Bernoulli and multinouli. If you want to find out more about those, or if you wish to learn a bit about what are probability distributions, discrete and continuous random variables, go here.

In this post, we’re covering two slightly bit more complicated, but more widely used than the previous two. Those are the Gaussian and the exponential distributions.

Gaussian distribution

Gaussian, or as some people might call it – normal, distribution is the most widely used probability distribution in machine learning, and in statistics for that matter. It deals with real numbers, just like the previous two. The formula goes something like this:


N(x, \mu, \sigma^2)= \sqrt{\frac{1}{2\pi\sigma^2}}^{-\frac{1}{2\sigma^2}(x-\mu^2)}

Now let’s go over the variables.

\mu \in \mathbb{R} – this variable is called the mean of the distribution, because \mathbb{E}[x]=\mu. It gives the x coordinate for the central peak of the graph.

\sigma \in (0, \infty) – this variable gives the standard deviation of the distribution and it gives the variance by \sigma^2.

Often times, because we need to invert \sigma do calculate PDF (probability density distribution, we use a placeholder variable o control the precision of the distribution. In that way, the computation if a bit more efficient . When we do that, we insert \beta \in (0, \infty), and the equation looks like this:

N(x, \mu, \beta^{-1})= \sqrt{\frac{\beta}{2\pi}}^{-\frac{1}{2}\beta(x-\mu^2)}

One huge advantage of the normal distribution is that the central limit theorem shows us that the sum of many independent random variables is approximately normally distributed.

Also , you could use the Gaussian distribution on \mathbb{R}^n  space. In that case, that would be called a multivariate normal distribution, and it’s a bit too much for this blog post. You can read more about it here.

Exponential distribution

The exponential distribution to have a sharp point x = 0  .

p(x, \lambda) = \begin{cases} \lambda e^{-\lambda x}, & x \geq 0 \\ 0, & x<0 \end{cases}

Here, \mathbb{E}[x] = \frac{1}{\lambda}  , and Var[x] = \frac{1}{\lambda^2}  .


  • Deep Learning (Adaptive Computation and Machine Learning series), by I.Goodfellow, Y.Bengio, A.Courville,

  • StatLect:,
  • Wikipedia: