The primary function of a feedfoward neural network is to create a prediction of some sorts. The most popular task that is handled by a feedfoward neural network is **classification** or **categorization**.

Classification is a task where a program is handed some sort of data, and the program classifies the data as something. One popular type of classification is image recognition.

For example, a neural network is fed an image and it has to predict whether the image contains a cat or a dog. This problem is very popular with people who are just starting to learn machine learning.

The neural network is usually designed to output a vector containing probabilities of the data being in one of the possible categories. For example, in the case of recognizing whether an image contains a dog or a cat, the neural network could output a vector [0.3, 0.7]. If the network is trained in such a way that the first value in the output vector represents the probability that the image contains a dog and the second value represents the probability that the image contains a cat, then the network is 70% certain that the image contains a cat.

The network is trained data which is segmented into two pieces – the data itself, and the labels. The norm today is to used what is called one-hot encoded labels.

A one-hot encoded label is a vector of size n, where n is the number of possible outcomes, which is almost entirely populated with zeros – it contains o a single 1.

That means that a label in the dog-cat classification problem the label would look like this: [0, 1]. We’ve previously defined that the first value tells us the probability that the image contains a dog, and the second that it contains a cat. So, in our little example, the image has 100% chance that it contains a cat.

We train our network to output a vector which is as close to the label as possible.

In order to do that, we have a couple of types of output units which help us.

## Sigmoid

Sigmoid is the first type of output unit, and activation unit, most people learn. It is very simple, and it used to be used much more.

A sigmoid unit is used when predicting the value of a binary variable. That means it can predict for only two cases.

That also means that the only probability distribution which the sigmoid unit is able to predict is the Bernoulli distribution. I’ve covered the basics of the Bernoulli distribution in this blog post.

A sigmoid output unit is defined by $latex y = \sigma (w^Th+b) &s=1 $.

$latex \sigma = \frac{1}{1 + e^{-x}} &s=4 $

The sigmoid function defines an S-shaped curve:

## Softmax

The softmax unit is used when we want to the probability distribution over a discrete number of possible outcomes. This means that we use the softmax unit when representing a Multinoulli distribution, which I’ve also covered in this blog post.

When using the softmax unit for classification, first we need a standard linear layer:

$latex z = W^Th + b &s=4 $

When we’ve got the logit computed, we can than compute the prediction vector through the softmax function:

$latex softmax(z)_i = \frac{e^{z_i}}{\Sigma_j e^{z_j}} &s=4 $

The softmax function, unlike the sigmoid functinon, does not have a characteristic graphical representation. Instead, it’s graphical representation of the function is represented represented with a histogram of irregular shape. The histogram of the Multinoulli distribution is irregular because of the distribution’s nature, unlike the Gaussian distribution.

The output vector of the softmax layer contains values in the interval [0, 1]. Just like with the sigmoid function, each number represents the percentage of probability. The sum of all values in the output vector must always equal 1.

## References

- Goodfellow, Ian., Bengio, Yoshua. and Courville, Aaron. 2016. Deep learning book. MIT Press.