From zero to a working implementation

Image by Mike MacKenzie at Flickr

Artificial neural networks, the subject of our article, are mathematical models that are inspired by biological neural networks and are attempting to imitate them. As biological neural networks are a combination of more neurons aimed at a specific task, artificial neural networks are a combination of more linear models often called artificial neurons or perceptrons. If you want to find out more about a perceptron, which is the simplest neural network, you can read more about it in the following article:

Neural networks are usually represented visually as a graph-like structure like the one below:

Image by / CC BY-SA at Wikimedia Commons

Above is represented a neural network with 3 layers: input, hidden, and output layer, consisting of 3, 4, and 2 neurons. The input layer has as many nodes as the number of features of your dataset. For the hidden layer, you are free to choose how many nodes you want and you can use more than one hidden layer. Those networks with more than one hidden layer are named deep neural networks and are the main characters of the field of deep learning. Networks with just one hidden layer are usually called shallow neural networks. The output layer should have as many neurons as the number of variables you want to predict.

There are more kinds of neural networks: convolutional, recurrent, spiking neural networks, to name a few. But here we will discuss only about simple feed-forward neural networks or multi-layer perceptrons.

Neural networks can be used in a wide range of problems, for both classification and regression tasks. For simplicity, in the most part of the article we will focus only on the classification task and, as we will see later, the things that we learn through this article about neural networks can be easily translated to regression problems.

Each neuron in the network, except those in the input layer, can be thought of as being a linear classifier that takes as input all the outputs of the neurons in the previous layer and computes a weighted sum of those plus a bias term. Then, the neurons in the next layer will take as input the values computed by the previous layer of linear classifiers, then compute a weighted sum of those, and so on. Our hope is that, by combining linear classifiers in this way, we are able to construct more complex classifiers that can represent non-linear patterns in our data.

Let’s take a look at the following example:

This dataset is clearly not linearly separable, we can not separate one class from the other by a line. But we can do this separation by using 2 lines as the decision boundary.

So, we may think that 2 intermediate neurons would do the job. These 2 neurons will learn the 2 separation lines in the image above. And then we will need an output neuron that will take as input these 2 previous neurons, and then it will be able to do the classification right.

In order for the last neuron to do the classification right, it needs that the outputs of the n1 and n2 hidden neurons to be linearly separable if we plot them in a 2d plane. The 2 lines plotted above have the equations:

This means that the 2 hidden neurons are computing the following linear combinations of the input x1 and x2:

Let’s plot now n1 and n2 and see if they helped us.

And we are disappointed by our little neural network. The outputs of n1 and n2 are still not linearly separable and thus the output neuron can’t do the classification right.

So, what is the problem? The thing is that any linear combination of linear functions is still linear, and it is not hard to convince yourself on a piece of paper that this is true. So, no matter how many layers or how many neurons we use, the way we proceeded so far our neural network will still be just a linear classifier.

We need something more. We need to take the weighted sum computed by each neuron and pass it through a non-linear function, then consider the output of this function as the output of that neuron. These functions are called activation functions and as you can see in our example they are very important in allowing a neural network to learn complex patterns in data. It has been proven[1] that a neural network with 2 layers (except the input one) and non-linear activation functions is able to approximate any function, provided that it has a large enough number of neurons in those layers.

So, if only 2 layers are enough, why are people using much deeper networks nowadays? Well, just because these 2-layer networks are “able” to learn anything, it doesn’t mean that they are easy to optimize. In practice, if we give our network overcapacity, they will give us good enough solutions even if they are not optimized as good as they could.

There are more kinds of activation functions, two of which we want to use in the example above. These are ReLU (Rectified Linear Unit) and tanh (hyperbolic tangent) and are shown below.

Another way to think about neural networks is that they are trying to learn both the classification and the feature representations at the same time. If you read my article about perceptron, you saw towards the end of it how we used a linear classifier to solve a non-linear classification problem by adding non-linear features to our input vector.

But how would you always know what extra features to add to your input vector in order to obtain good results? Neural networks attempt to solve this problem by also learning what such feature representations would be good. In the hidden layers, a neural network actually learns the best representation of your input data in order to perform well in its last layer where the actual classification takes place.

Now, let’s continue with our example. What would happen if we use the ReLU activation in our example? Below are plotted the outputs of neurons n1 and n2 after the ReLU activation is applied.

Now our two classes of points can be separated by a line, and thus the output neuron can classify them correctly.

A similar thing happens if we use the tanh activation, but this time our points are separated even better by a bigger margin.

Again, the output neuron can classify the points correctly.

In my opinion, the common visual representation of a neural network, that nasty graph-like structure, can be a little confusing. Especially if you want to implement it, it is harder to think about what you need to do with all those connections and weights floating around. Thinking at parameter-level can be cumbersome, but thinking at a higher level in terms of vector and matrix operations that are done in the network is clearer, at least for me. Thinking about neural networks as a composition of functions, a chain of vector functions, has helped me in the implementation that I did, and I hope it will also help you. Take a look at the image below:

If we represent the input and output values of each layer as vectors, the weights as matrices, and biases as vectors, then we get the above flattened view of a neural network which is just a sequence of vector function applications. That is functions that take vectors as input, do some transformation on them, and then output other vectors.

In the image above, each line represents a function, which can be either a matrix multiplication plus a bias vector, or an activation function. And the circles represent the vectors on which these functions operate. For example, we start with the input vector, then we feed it into the first function, which computes linear combinations of its components, then we obtain another vector as output. This last vector we feed as input to the activation function, and so on until we get to the last function in our sequence. The output of this last function will be the predicted value of our network.

This representation is also closer to what we actually have to implement. In order to implement a neural network we don’t actually need to store that graph that is usually represented in images. All a neural network is doing for a living is matrix operations, both for making predictions and for training.

We have seen how a neural network gets its output, which we are interested in, it just passes its input vector through a sequence of functions. But these functions depend on some parameters: the weights and biases.

How do we actually learn those parameters in order to obtain good predictions?

Well, let’s recall what a neural network actually is: it’s just a function, a big function composed of smaller ones that are applied in sequence. This function has a set of parameters that, because at first, we have no idea what they should be, we just initialize them randomly. So, at first, our network will give us just random values.

How we can improve them? Before attempting to improve them we first need a way of evaluating the performance of our network. How we are supposed to improve the performance of our model if we don’t have a way to measure how good or how bad is it doing? For that, we need to come up with a function that takes as parameters the predictions of our network and the true labels in our dataset, and to give us a number which represents the performance of our network.

Then we can turn the learning problem into an optimization problem of finding the minimum or maximum of this function. In the machine learning community, this function usually measures how bad our predictions are, hence it is named a loss function. And our problem is to find the parameters of our network that minimizes this loss function.

Stochastic gradient descent

You may be familiar with the problem of finding the minimum of a function from your calculus class. There you usually take the gradient of your function, set it equal to 0, find all the solutions (also called critical points), and then choose among them the one that gives your function the smallest value. And that’s the global minimum.

Can we do the exact same thing in minimizing our loss function? Not really. The problem is that the loss function of a neural network is not as nice and compact as those you usually find in calculus textbooks. It is a very complicated function with thousands, hundreds of thousands, or even millions of parameters. It may be even impossible to find a closed-form solution to this problem. This problem is usually approached by iterative methods, methods that don’t try to find a direct solution, but instead, start with a random solution and try to improve it a little bit in each iteration. Eventually, after a large number of iterations, we will get a pretty good solution.

One such iterative method is gradient descent. As you may know, the gradient of a function gives us the direction of the steepest ascent, and if we take the negative of the gradient it will give us the direction of the steepest descent, that is the direction in which we can get the fastest towards a minimum.

So, at each iteration, also called an epoch, we compute the gradient of the loss function and subtract it (multiplied by a factor called learning rate) from the old parameters in order to get the new parameters of our network.

Where theta represents a vector containing all the network’s parameters.

In the standard gradient descent method, the gradient is computed taking into account the whole dataset. Usually, this is not desirable as it may be computationally expensive. In practice, the dataset is randomly divided into more chunks called batches, and an update is made for each one of these batches. This is called stochastic gradient descent.

The above update rule takes into account at each step only the gradient evaluated at the current position. In this way, the trajectory of the point that moves on the surface of the loss function is sensitive to any perturbation. Sometimes we may want to make this trajectory more robust. For that, we use a concept inspired by physics: momentum.

The idea is that when we do the update to also take into consideration previous updates, that accumulates into a variable Δθ. If more updates are done in the same direction, then we will go “faster” into that direction and won’t change our trajectory by any small perturbation. Think about this like velocity.

Where α is a non-negative factor that determines the contribution of past gradients. When it is 0 we simply don’t use momentum.


How do we actually compute the gradient? Recall that a neural network, and thus the loss function, is just a composition of functions. How we can compute partial derivatives of composite functions? Using chain rule. Let’s look at the following image:

If we want to compute the partial derivatives of the loss w.r.t. (with respect to) the weights of the first layer: we take the derivative of the first linear combination w.r.t. the weights, then we multiply with the derivative of the next function (the activation function) w.r.t. the output from the previous function, and so on until we multiply with the derivative of the loss w.r.t. the last activation function. What if we want to compute the derivative w.r.t. the weights of the second layer? We have to do the same process, but this time we start with the derivative of the second linear combination function w.r.t. its weights, and after that the rest of the terms that we have to multiply with were also present when we computed the derivative of the weights of the first layer. So, instead of computing these terms over and over again, we will go backwards, hence the name backpropagation.

We will first start by computing the derivative of the loss w.r.t. the output of our network, and then propagate these derivatives backward towards the first layer by maintaining a running product of derivatives. Note that there are 2 kinds of derivatives that we take: one in which we compute the derivative of a function w.r.t. its input. These we multiply to our product of derivatives and have the purpose to keep track of the error of the network from its output to the current point in which we are in the algorithm. The second kind of derivatives is the one that we take w.r.t. the parameters that we want to optimize. These we don’t multiply with the rest of our product of derivatives, instead, we store them as part of the gradient that we will use later to update the parameters.

So, while backpropagation, when we encounter functions that don’t have learnable parameters (like activation functions) we take derivatives only of the first kind, just to propagate the errors backward. But, when we encounter functions that do have learnable parameters (like the linear combinations, we have there the weights and biases that we want to learn) we take derivatives of both kinds: the first one w.r.t. its input for error propagation, and the second one w.r.t. its weights and biases, and store them as part of the gradient. We do this process starting from the loss function and until we get to the first layer where we don’t have any learnable parameters that we want to add to the gradient. This is the backpropagation algorithm.


Before moving next, I want to talk a little bit about overcapacity in neural networks. Recall our previous example:

As you can see the minimum number of neurons in the hidden layer that are needed to solve this classification task is 2, one for each of the 2 lines above. In neural networks it is a good idea to add some overcapacity, that is to add more neurons and/or layers than the minimum needed to solve that particular problem. That is because adding more parameters will make the optimization problem easier. In the example above, if we use only 2 hidden neurons we will need each one of them to learn a line that is “almost perfect” in order for the final result to be good. But if we give more freedom to our network and add more neurons in the hidden layer, they will not need to be perfect. We will be able to get good results if most of them will do at least something useful to our classification task. And then, we can think of the last neuron as averaging the decision boundary of each them.

Another way to think about why adding more parameters may make the optimization task easier is by imagining what happens with the loss function’s “surface” when we add more parameters. In the 2d case there are only 2 directions in which we can move our (single) parameter, and if in both of these directions the height of the loss function is bigger than at the current point, then we are stuck in a local minimum, and we may miss completely the global minimum. Now, let’s imagine that we add one more parameter, and thus our plot becomes a surface in 3d space. In this case there are an infinite number of directions in which we could go, and there are more chances that among all of them to find at least one that will make us go downwards to a lower point on the loss surface. So, there are fewer chances to get stuck into a local minimum (at least compared with the 2d case).

It is always better to increase the number of parameters of our network? No. Having too many parameters will make our network more likely to overfit. So, there is a trade-off between making the optimization problem easier and keeping the network away from overfitting.

Softmax activation & Cross entropy loss

A commonly used activation function for the last layer in a classification task is the softmax function.

The softmax function transforms its input vector into a probability distribution. If you look above you can see that the elements of the softmax’s output vector have the properties that they are all positive, and their sum is 1. When we use softmax activation we create in the last layer as many nodes as the number of classes in our dataset, and the softmax activation will give us a probability distribution over the possible classes. So, the output of the network will give use the probability that the input vector belongs to each one of the possible classes, and we choose the class that has the highest probability and report that as the prediction of our network.

When softmax is used as activation of the output layer, we usually use as loss function the cross-entropy loss. The cross-entropy loss measures how similar are 2 probability distributions. We can represent the true label of our input x as a probability distribution: one in which we have a probability of 1 for the true class label and 0 for the other class labels. This representation of labels is also called one-hot encoding. Then we use cross-entropy to measure how close is the predicted probability distribution of our network to the true one.

Where y is the one-hot encoding of the true label, y hat is the predicted probability distribution, and yi, yi hat are elements of those vectors.

If the predicted probability distribution is close to the one-hot encoding of the true labels, then the loss would be close to 0. Otherwise, if they are very different, the loss can potentially grow to infinity.

Mean squared error loss

So, far we focused on classification tasks. But, neural networks can easily be adapted to regression tasks by just using an appropriate loss function. For example, if instead of class labels as ground truth, we have a list of numbers that we want to approximate we can use mean squared error (MSE for short) loss. Usually, when we use MSE loss, we use identity activation in the last layer.

Implementing this in python

Now it is time for actually implementing a neural network in python, and then to test it on a few examples to see how well it performs.

I think that the best way to really understand how a neural network works is to implement one from scratch. I challenge you to implement it by your own. And when you’re done come back here and compare with my code and do the examples that I did at the end of this article to see if you get similar results. If you get stuck at some point feel free to come here to see how I did it.

I will create a NeuralNetwork class and I want to design it in such a way to be more flexible. I don’t want to hardcode in it specific activation or loss functions, or optimizers (that is SGD, Adam, or other gradient-based methods). I will design it to receive these from outside the class so that one can just take the class’s code and pass to it whatever activation/loss/optimizer he wants. So, I will implement activation and loss functions, and optimizer class that we want to use here as separate things from the NeuralNetwork class. And we need both activation/loss functions and their derivatives.

In order to allow batch sizes greater than 1, our activation and loss functions should handle matrix input. Rows in these matrices will represent samples of data, and the columns will be features. Our network will allow for 2 kinds of activation functions: for hidden layers and for the output layer. The hidden layer activations should operate on their input vectors element-wise, and thus their derivatives will also be element-wise, returning one vector for each sample. But the output activation allows for each element in the output vector to be computed based on all the elements in the input vector. That is in order to be able to use softmax activation. Because of this, their derivatives needs to return a Jacobian matrix (a matrix consisting of the partial derivatives of each output function w.r.t. each of the input component; you can read more here) for each sample.

Here we will use only ReLU as hidden activation; identity and softmax will be used as output activations.

We used the EPS variable, which is the smallest positive representable number of float64 type, in order to avoid division by 0. In order to avoid overflow errors in softmax function, we subtracted the maximum of each sample from the input. We are allowed to do that because it doesn’t change the output of the function as it has the same effect as dividing both terms of that fraction by the same amount.

The loss functions should take as input 2 matrices: the predicted y and true y, both of them of the same form as in the activation functions. These loss functions should output a single number for each sample. Their derivatives should output a row-vector for each sample, all of them stacked into an array of dimension 3. This output shape is required in order to be able to use NumPy’s matmul() function to multiply with the derivative of output activation. Note the use of expand_dims() function below which is used to return the required shape.

Here we will use only stochastic gradient descent with momentum as an optimization method, but there are more gradient-based methods out there. Some popular choices are Adam, RMSprop, Adagrad. In order to allow our neural network class to work with all of these, we will implement the optimizer as a separate class with a .update(old_params, gradient) method that returns the updated parameters. The neural network class will receive an optimizer as a parameter. So, someone who wants to use other optimization methods can create a class with the required interface and pass it to our neural network class when instantiating.

Below is the SGD + momentum optimizer:

To convert class labels in classification tasks to one-hot encoding we will use the to_categorical() utility function:

Now, let’s start with the code of the NeuralNetwork class. The instantiation method expects the following parameters:

  • layers: a list consisting of the number of nodes in each layer (including input and output layers)
    e.g.: [5, 10, 2] means 5 inputs, 10 nodes in hidden layer, and 2 output nodes
  • hidden_activation: activation of hidden layers; a tuple of form (activation_function, its_derivative)
    This activation function and its derivative should perform their task element-wise on the input array
    e.g.: (relu, d_relu)
  • output_activation: activation of output layer; a tuple of form (activation_function, its_derivative)
    This activation function takes as input an array of shape (n, m); n samples, m neurons in output layer; and returns an array of shape (n, m); each element on a row in the output array is the output of a function of all the elements on that row in the input array.
    Its derivative takes as input an array similar to the one taken by the activation, but it returns an array of shape (n, m, m) which is a stack of Jacobian matrices, one for each sample.
  • loss: a tuple of form (loss_function, its_derivative)
    The loss function takes as input two arrays (predicted y and true y) of shape (n, m); n samples, m neurons in output layer; and returns an array of shape (n, 1), whose elements are the loss for each sample.
    Its derivative takes as input an array of shape (n, m) and returns one of shape (n, 1, m) which is a stack of row-vectors consisting of the derivatives w.r.t. each one of the m input variable.
    e.g.: (categorical_crossentropy, d_categorical_crossentropy)
  • optimizer: an object with a method .update(old_params, gradient) that returns the new params
    e.g.: SGD()

Then, it initializes its weights and biases using a variant of the Xavier initialization method. That is, we draw the weights and biases from a normal distribution with mean 0 and standard deviation of:

where fan_in and fan_out are the number of nodes in the previous layer, respectively the number of neurons in the next layer. The number of rows in weights matrices match the number of nodes in the previous layer, and the number of columns match the number of nodes in the next layer. The biases are row vectors with the number of elements matching the number of nodes in the next layer.

To easily do the parameters update procedure we will create a .__flatten_params(weights, biases) method that transforms the list of weight matrices, and bias vectors received as input, to a flattened vector. We will also need a .__restore_params(params) method that turns a flattened vector of parameters back into lists of weights and biases. Note that the 2 underscores in front of the method name just means that the method is private in OOP terms. This just means that the method should be used only from inside the class.

The .__forward(x) method passes the input array x through the network and while it does so, it keeps track of the input and output arrays to and from each layer. Then it returns this as a list of which the i-th element is a list of form [input to layer i, output of layer i]. We will need those arrays to compute the derivatives in the backward pass.

The .__backward(io_arrays, y_true) method computes the gradient. It takes as input a list of the form returned by .__forward(x) method, and an array with the ground truth y. It computes the gradient of weights and biases using the backpropagation algorithm as described earlier in this article. Then it returns a tuple (d_weights, d_biases).

The method that actually orchestrates all the training is .fit(x, y, batch_size, epochs, categorical), where:

  • x is the input data
  • y is the ground truth
  • batch_size is the size of a batch of data
  • epochs is the number of iteration through all the input data
  • categorical is a optional parameter that, when set to true will convert y to one-hot encoding

For each batch of data, it uses .__forward() and .__backward() methods to compute the gradient then flattens the current parameters of the network and the gradient using .__flatten_params(). After that, it computes the new parameters using self.optimizer.update(), then restores that returned vector to the right format with __restore_params() and assigns that to self.weights, self.biases. At the end of each batch, the progress and average loss are printed. A list of all the loss values at the end of each epoch is maintained and returned.

By default, the .predict() method will return the exact values that are in the output nodes after the input x is passed through the network. If labels parameter is set to true, then the predicted labels are returned; this is probably what you want in a classification problem.

The .score() method returns by default the average loss. If accuracy is set to true, then the accuracy will be returned. Note that in a classification problem, if you want the loss then y should be provided in one-hot encoding format, otherwise, if you want accuracy to be returned then y should be just regular class labels.

Finally, we want to be able to save the parameters locally, so that we don’t have to train our models each time we want to make a prediction. Note that the below methods are able to save and load just the weights and biases, but not the whole information about the layers, activations, loss function, and optimizer. So, you should also save the code used to instantiate the neural network.

Below is the full code:


Below we will show 2 examples in which we use our NeuralNetwork class that we just implemented.

The first example consists of classifying images of hand-written digits from the MNIST database. This dataset consists of 60,000 training and 10,000 testing grayscale images of 28 x 28 pixels.

We use the mnist package (pip install mnist) to load this dataset.

Then we construct a neural network and train it for 100 epochs.

Let’s plot the training loss.

Let’s look at the accuracy we obtained on both train and test set:

And we got 99.8% train and 95.9% test accuracy. That’s pretty good for our home-made neural network.

Now we move to the second example in which we use our neural network for a regression problem.
We will use this time the California housing dataset that comes with sklearn package. The dataset consists of 20640 samples of 8 predictive attributes and the target variable which is ln(median house value). This dataset was derived from the 1990 U.S. census, using one row per census block group. A block group is the smallest geographical unit for which the U.S. Census Bureau publishes sample data.

This time we will use mean squared error as loss function.

Let’s plot the loss during training.

Let’s see what is our final loss value for train and test sets.

For both train and test we got a loss of about 0.36. Note that the target variable is on a logarithmic scale. So, the interpretation of the mean squared error is a little not intuitive here. We usually say that the predicted values are off on average by +/- square root of MSE. Now, in our case, the median house values predicted by our network are off on average by a factor (instead of +/- we have multiplication/division) of e to the square root of MSE (in our case this factor is about 1.83).


[1] Cybenko, G.V. (2006). “Approximation by Superpositions of a Sigmoidal function”. In van Schuppen, Jan H. (ed.). Mathematics of Control, Signals, and Systems. Springer International. pp. 303–314.

You can find the notebook and python files on Github here.

I hope you found this information useful and thanks for reading!

This article is also posted on Medium here. Feel free to have a look!


Passionate about Data Science, AI, Programming & Math

0 0 votes
Article Rating
Notify of
Inline Feedbacks
View all comments
Would love your thoughts, please comment.x