Activation functions motivated by examples

If you are reading this, then most probably you already know what a neural network is and what activation functions are, but, some introductory courses on machine learning don’t show clear enough, why do we need these activation functions. Do we need them? Would neural networks work without them?
Let us first remind ourselves of a few things about neural networks. They are usually represented visually as a graph-like structure like the one below:

Above is represented as a neural network with 3 layers: input, hidden, and output layer, consisting of 3, 4, and 2 neurons.
The input layer has as many nodes as the number of features of your dataset. For the hidden layer, you are free to choose how many nodes you want, and you can use more than one hidden layer.
Each neuron in the network, except those in the input layer, can be thought of as being a linear classifier that takes as input all the outputs of the neurons in the previous layer and computes a weighted sum of those plus a bias term. Then, the neurons in the next layer will take as input the values computed by the previous layer of linear classifiers, then compute a weighted sum of those, and so on. We hope that, by combining linear classifiers in this way, we can construct more complex classifiers that can represent non-linear patterns in our data.
Let us take a look at the following example dataset:

This dataset is not linearly separable, we cannot separate one class from the other by a line. But we can do this separation by using 2 lines as the decision boundary.

So, we may think that 2 intermediate neurons would do the job. These 2 neurons will learn the 2 separation lines in the image above. And then we will need an output neuron that will take as input these 2 previous neurons, and then it will be able to do the classification right.

For the last neuron to do the classification right, it needs that the outputs of the n1 and n2 hidden neurons to be linearly separable if we plot them in a 2d plane. The 2 lines plotted above have the equations:

This means that the 2 hidden neurons are computing the following linear combinations of the inputs x1 and x2:

Let us plot now n1 and n2 and see if they helped us.

And we are disappointed by our little neural network. The outputs of n1 and n2 are still not linearly separable and thus the output neuron cannot do the classification right. So, what is the problem? The thing is that any linear combination of linear functions is still linear, and it is not hard to convince yourself on a piece of paper that this is true. There is proof of this fact at the end of this article. So, no matter how many layers or how many neurons we use, the way we proceeded so far, our neural network will still be just a linear classifier.
We need something more. We need to take the weighted sum computed by each neuron and pass it through a non-linear function, then consider the output of this function as the output of that neuron. These functions are called activation functions and, as you can see next in this article, they are essential in allowing a neural network to learn complex patterns in data.
It has been proven[1] that a neural network with 2 layers (except the input one) and non-linear activation functions can approximate any function, provided that it has a large enough number of neurons in those layers. So, if only 2 layers are enough, why are people using much deeper networks nowadays? Well, just because these 2-layer networks are “able” to learn anything, it does not mean that they are easy to optimize. In practice, if we give our network overcapacity, they will give us good enough solutions even if they are not optimized as good as they could.
There are more kinds of activation functions, two of which we want to use in the example above. These are ReLU (Rectified Linear Unit) and tanh (hyperbolic tangent) and are shown below.




What will happen if we use the ReLU activation in our example? Below are plotted the outputs of neurons n1 and n2 after the ReLU activation is applied.

Now our two classes of points can be separated by a line, and thus the output neuron can classify them correctly.

A similar thing happens if we use the tanh activation, but this time our points are separated even better by a bigger margin.

Again, the output neuron can classify the points correctly.

Here is a brief mathematical proof of the fact that any linear combination of linear functions is still linear:

Where a0, a1, …, an are constants that do not depend on inputs x1, …, xn.
References
[1] Cybenko, G.V. (2006). “Approximation by Superpositions of a Sigmoidal function”. In van Schuppen, Jan H. (ed.). Mathematics of Control, Signals, and Systems. Springer International. pp. 303–314.
I hope you found this information useful and thanks for reading!
This article is also posted on Medium here. Feel free to have a look!