Understanding the building block of Neural Networks
The perceptron is the building block of artificial neural networks, it is a simplified model of the biological neurons in our brain. A perceptron is the simplest neural network, one that is comprised of just one neuron. The perceptron algorithm was invented in 1958 by Frank Rosenblatt.
Below is an illustration of a biological neuron:
The majority of the input signal to a neuron is received via the dendrites. There are about 1,000 to 10,000 connections that are formed by other neurons to these dendrites. The signal from the connections, called synapses, propagate through the dendrite into the cell body. The potential increases in the cell body and once it reaches a threshold, the neuron sends a spike along the axon that connects to roughly 100 other neurons through the axon terminal.
The perceptron is a simplified model of the real neuron that attempts to imitate it by the following process: it takes the input signals, let’s call them x1, x2, …, xn, computes a weighted sum z of those inputs, then passes it through a threshold function ϕ and outputs the result.
But having w0 as a threshold is the same thing as adding w0 to the sum as bias and having instead a threshold of 0. That is, we consider an additional input signal x0 that is always set to 1.
Here is represented a perceptron:
To use vector notation, we can put all inputs x0, x1, …, xn, and all weights w0, w1, …, wn into vectors x and w, and output 1 when their dot product is positive and -1 otherwise.
Here is a geometrical representation of this using only 2 inputs x1 and x2, so that we can plot it in 2 dimensions:
As you see above, the decision boundary of a perceptron with 2 inputs is a line. If there were 3 inputs, the decision boundary would be a 2D plane. In general, if we have n inputs the decision boundary will be a n-1 dimensional object called a hyperplane that separates our n-dimensional feature space into 2 parts: one in which the points are classified as positive, and one in which the points are classified as negative(by convention, we will consider points that are exactly on the decision boundary as being negative). Hence the perceptron is a binary classifier that is linear in terms of its weights.
In the image above w’ represents the weights vector without the bias term w0. w’ has the property that it is perpendicular to the decision boundary and points towards the positively classified points. This vector determines the slope of the decision boundary, and the bias term w0 determines the offset of the decision boundary along the w’ axis.
So far we talked about how a perceptron takes a decision based on the input signals and its weights. But how a perceptron actually learns? How to find the right set of parameters w0, w1, …, wn in order to make a good classification?
The perceptron algorithm is an iterative algorithm that is based on the following simple update rule:
Where y is the label (either -1 or +1) of our current data point x, and w is the weights vector.
What does our update rule say?
The dot product x⋅w is just the perceptron’s prediction based on the current weights (its sign is the same as the one of the predicted label). The expression y(x⋅w) can be less than or equal to 0 only if the real label y is different than the predicted label ϕ(x⋅w). So, if there is a mismatch between the true and predicted labels, then we update our weights: w = w+yx; otherwise, we let them as they are.
So, why the w = w + yx update rule works?
It attempts to push the value of y(x⋅w), in the if condition, towards the positive side of 0, and thus classifying x correctly. And if the dataset is linearly separable, by doing this update rule for each point for a certain number of iterations, the weights will eventually converge to a state in which every point is correctly classified. Let’s see what’s the effect of the update rule by reevaluating the if condition after the update:
That is, after the weights update for a particular data point the expression in the if condition should be closer to being positive, and thus correctly classified.
The full perceptron algorithm in pseudocode is here:
Now let’s implement it in Python
We will now implement the perceptron algorithm from scratch in python using only NumPy as an external library for matrix-vector operations. We will implement it as a class that has an interface similar to other classifiers in common machine learning packages like Sci-kit Learn. We will implement for this class 3 methods:
.fit() method will be used for training the perceptron. It expects as the first parameter a 2D numpy array X. The rows of this array are samples from our dataset, and the columns are the features. The second parameter, y, should be a 1D numpy array that contains the labels for each row of data in X. The third parameter, n_iter, is the number of iterations for which we let the algorithm run.
def fit(self, X, y, n_iter=100): n_samples = X.shape n_features = X.shape # Add 1 for the bias term self.weights = np.zeros((n_features+1,)) # Add column of 1s X = np.concatenate([X, np.ones((n_samples, 1))], axis=1) for i in range(n_iter): for j in range(n_samples): if y[j]*np.dot(self.weights, X[j, :]) <= 0: self.weights += y[j]*X[j, :]
.predict() method will be used for predicting labels of new data. It first checks if the weights object attribute exists, if not this means that the perceptron is not trained yet, and we show a warning message and return. The method expects one parameter, X, of the same shape as in the
.fit() method. Then we just do a matrix multiplication between X and the weights and map them to either -1 or +1. We use
np.vectorize() to apply this mapping to all elements in the resulting vector of matrix multiplication.
def predict(self, X): if not hasattr(self, 'weights'): print('The model is not trained yet!') return n_samples = X.shape # Add column of 1s X = np.concatenate([X, np.ones((n_samples, 1))], axis=1) y = np.matmul(X, self.weights) y = np.vectorize(lambda val: 1 if val > 0 else -1)(y) return y
.score() method computes and returns the accuracy of the predictions. It expects as parameters an input matrix X and a labels vector y.
def score(self, X, y): pred_y = self.predict(X) return np.mean(y == pred_y)
Below is the full code:
A few examples
What I want to do now is to show a few visual examples of how the decision boundary converges to a solution.
In order to do so, I will create a few 2-feature classification datasets consisting of 200 samples using Sci-kit Learn’s
datasets.make_circles() functions. This is the code used to create the next 2 datasets:
X, y = make_classification( n_features=2, n_classes=2, n_samples=200, n_redundant=0, n_clusters_per_class=1 )
And the last dataset:
X, y = make_circles(n_samples=200, noise=0.03, factor=0.7)
For each example, I will split the data into 150 for training and 50 for testing. On the left will be shown the training set and on the right the testing set. The decision boundary will be shown on both sides as it converges to a solution. But the decision boundary will be updated based on just the data on the left (training set).
Example 1 — linearly separable
The first dataset that I will show is a linearly separable one. Below is an image of the full dataset:
This is a simple dataset, and our perceptron algorithm will converge to a solution after just 2 iterations through the training set. So, the animation frames will change for each data point. The green point is the one that is currently tested in the algorithm.
On this dataset, the algorithm had correctly classified both the training and testing examples.
Example 2 — Noisy dataset
What if the dataset is not linearly separable? What if the positive and negative examples are mixed up like in the image below?
Well, the perceptron algorithm will not be able to correctly classify all examples, but it will attempt to find a line that best separates them. In this example, our perceptron got a 88% test accuracy. The animation frames below are updated after each iteration through all the training examples.
Example 3 — Non-linear dataset
What about the below dataset?
It is separable, but clearly not linear. So you may think that a perceptron would not be good for this task. But the thing about a perceptron is that it’s decision boundary is linear in terms of the weights, not necessarily in terms of inputs. We can augment our input vectors x so that they contain non-linear functions of the original inputs. For example, in addition to the original inputs x1 and x2 we can add the terms x1 squared, x1 times x2, and x2 squared.
polynomial_features(X, p) function below is able to transform the input matrix X into a matrix that contains as features all the terms of a polynomial of degree p. It makes use of the
polynom() function which computes a list of indices that represent the columns to be multiplied for obtaining the p-order terms.
def polynom(indices_list, indices, a, b, p): indices = [*indices] if p == 0: indices_list.append(indices) return for i in range(a, b): indices.append(i) polynom(indices_list, indices, i, b, p-1) indices = indices[0:-1] def polynomial_features(X, p): n, d = X.shape features =  for i in range(1, p+1): l =  polynom(l, , 0, d, i) for indices in l: x = np.ones((n,)) for idx in indices: x = x * X[:, idx] features.append(x) return np.stack(features, axis=1)
For our example, we will add degree 2 terms as new features in the X matrix.
X = polynomial_features(X, 2)
Now, let’s see what happens during training with this transformed dataset:
Note that for plotting, we used only the original inputs in order to keep it 2D. The decision boundary is still linear in the augmented feature space which is 5D now. But when we plot that decision boundary projected onto the original feature space it has a non-linear shape.
With this method, our perceptron algorithm was able to correctly classify both training and testing examples without any modification of the algorithm itself. All we changed was the dataset.
With this feature augmentation method, we are able to model very complex patterns in our data by using algorithms that were otherwise just linear.
But, this method is not very efficient. Imagine what would happen if we had 1000 input features and we want to augment it with up to 10-degree polynomial terms. Fortunately, this problem can be avoided using something called kernels. But that’s a topic for another article, I don’t want to make this one too long.
I hope you found this information useful and thanks for reading!
This article is also posted on Medium here. Feel free to have a look!