Gradient descent is the backbone of a lot of machine learning algorithms, deep learning included. It is used only during the training, and it is the most computationally expensive part of machine learning.

Gradient descent is a very complicated mathematical topic which I cannot explain in its entirety in a single blog post. However, I will try my best to cover as much of the basics, in as simple to understand ways as possible.

In this blog post, I will not cover any optimization algorithms like Adam or Adagrad, but we will be going over the basic algorithm of gradient descent.

## What is gradient descent?

Gradient descents, also known as the method of steepest descent, is a mathematical method in which we try to minimize or maximize an **objective function**, or **criterion**.

The function which we usually are trying to optimize is a cost function, also known as the loss or the error function.

Usually, this method returns either a scalar or a vector value which tells us how different the prediction of the deep learning model is from the labeled expected result.

## The derivative

The derivative is a crucial concept of mathematics, specifically of calculus. It is used extensively in machine learning.

The derivative measures the steepness of a function at some particular point on the graph. It is the slope of the function.

There are several ways of defining the derivative.

$latex \frac{dy}{dx}x_0 = f'(x_0) = lim_{h \to 0} \frac{f(x_0 + h) – f(x_0)}{h} &s=3 $

The derivative is used for minimizing a function because it tells us how to change *x* in order to make a small improvement in *y*

$latex f(x) = f(x) – \epsilon f'(x) &s=3$

We can reduce *f(x)* by moving *x* in small steps with the opposite sign of the derivative.

## Critical points

Critical points are those points where the derivative gives us no information about in which direction we need to move in order to minimize the function. In another words $latex f'(x) = 0$.

There are a cuople of critical points.

A **local minimum** is a point where *f(x)* is lower than all neighboring points.

A **local maximum** is a point where *f(x)* is higher than at all neighboring points.

Then, there is also the **saddle point**. That is a point which is neither a minimum nor a maximum.

Alongside the local minimum, there is also the **global minimum**. The global minimum is a point what obtains the absolute lowest value of *f(x)*. There can be only one global minimum, but there can be multiple local minima.

For functions which multiple inputs, we use the **partial derivative**. The partial derivative measures how *f* changes as only the variable x[i] increases at the point x.

It is denoted as:

$latex \frac{\partial}{\partial x_i}f(x) &s=3$

## The gradient

The gradient is a crucial part of gradient descent – hence the name.

The gradient generalizes the notion of derivative to the case where the derivative is with respect to a vector. That means, that the gradient of *f* is a vector containing all the partial derivatives with respect to all the values in the input vector. It is denoted as:

$latex \nabla_x f(x) &s=3$

* i*th element of the gradient is the partial derivative of

*f*with respect to x[i].

## Method of steepest descent

Now that we know the basics of calculus we need, we can examine how the method of the steepest descent suggests a new point.

$latex x’ = x – \epsilon \nabla_x f(x) &s=3$

The value $latex \epsilon $ is the **learning rate** – a scalar value which determines how big of a step are we going to take in the direction opposite to the gradient in order to improve our probabilistic model.

There are a couple of methods used to find the optimal learning rate, but most of the time, you would just try a couple of values and see which one works the best. This method is called **line search**.

## Further research

This next section does not fall into the category of mathematical *“basics,” *but it is good to know a thing or two about these concepts.

#### Jacobian matrix

The Jacobian matrix is used to store all the partial derivatives of a function *f* whose both input and output are vectors. It is defined as:

$latex J_{i,j} = \frac{\partial}{\partial x_j}f(x)_i &s=3$

#### Second derivative

The second derivative is the *derivative of the derivative*. It is denoted as:

$latex \frac{\partial^2}{\partial x^2}f(x) = f”(x) &s=3$

The second derivative is used to see how does the derivative change as we vary the input.

#### Hessian matrix

Hessian matrix is similar to the Jacobian matrix in the sense that it also stores values for functions whose both input and output are vectors. The difference is that the Jacobian matrix stores the all the derivatives, and the Hessian matrix stores all the **second derivatives**. It is defined such that:

$latex H(f)(x)_{i,j} = \frac{\partial^2}{\partial x_i \partial x_j} f(x) &s=3$

The Hessian is the Jacobian of the gradient.

## References

- http://www.ugrad.math.ubc.ca/coursedoc/math100/notes/derivs/deriv5.html
- Deep Learning (Adaptive Computation and Machine Learning series), by I.Goodfellow, Y.Bengio, A.Courville, http://www.deeplearningbook.org/
- Wikipedia: https://en.wikipedia.org/wiki/Gradient_descent