Understand better linear regression & sharpen your NumPy skills
Let’s first briefly recall what linear regression is:
Linear regression is estimating an unknown variable in a linear fashion based on some other known variables. Visually, we fit a line (or a hyperplane in higher dimensions) through our data points.
If you’re not comfortable with this concept or want to understand better the math behind it, you can read my previous article about linear regression:
Now, let’s focus on implementation.
Firstly, we need to, obviously, import some libraries. We import
numpy as it is the main thing we use for the implementation,
matplotlib for visualizing our results, and
make_regression function, from
sklearn, which we will be using to generate a regression dataset for using as an example.
import numpy as np import matplotlib.pyplot as plt from sklearn.datasets import make_regression
Then we will create a
LinearRegression class with the following methods:
.fit()— this method will do the actual learning of our linear regression model; here we will find the optimal weights
.predict()— this one will be used for prediction; it will return the output of our linear model
.rmse()— computes the root mean squared error of our model with the given data; this metric is kind of “the average distance from our model’s estimate to the true y value”
The first thing we do inside
.fit() is to concatenate an extra column of 1’s to our input matrix X. This is to simplify our math and treat the bias as the weight of an extra variable that’s always 1.
.fit() method will be able to learn the parameters by using either closed-form formula or stochastic gradient descent. And to choose which to use, we will have a parameter called
method that will expect a string of either ‘solve’ or ‘sgd’.
method is set to ‘solve’ we will get the weights of our model by the following formula:
which requires the matrix X to have full column rank; so, we will check for this and otherwise we show an error message.
The first part of our
.fit() method is:
Note that the other parameters after
method are optional and are used only in the case we use SGD.
The second part of this method handles the case of
method = ‘sgd’, which doesn’t require that X has full column rank.
The SGD algorithm for our least squares linear regression is sketched below:
We will start this algorithm by initializing the weights class attribute to a numpy vector with values drawn from a normal distribution with mean 0 and standard deviation 1/(number of columns). We divide the standard deviation by the number of columns to make sure we don’t get too big values as output in the initial stages of the algorithm. This is to help us converge faster.
At the beginning of each iteration, we randomly shuffle our rows of data. Then, for each batch, we compute the gradient and subtract it (multiplied by the learning rate) from the current weights vector to obtain the new weights.
Below is the second half of our
self from this method to be able to concatenate the calls of the constructor and
.fit() like this:
lr = LinearRegression().fit(X, y, ‘solve’).
.predict() method is quite straight-forward. We first check if
.fit() was called before, then concatenate a column of 1’s to X and verify that the shape of X allows multiplication with the weights vector. If everything is OK, we simply return the result of the multiplication between X and the weights vector as the predictions.
.rmse() we first get the outputs of the model using
.predict(), then if there were no errors during predict, we compute and return the root mean squared error which can be thought of as “the average distance from our model’s estimate to the true y value”.
Below is the full code of the
LinearRegression class in an example
To show our implementation of linear regression in action, we will generate a regression dataset with the
make_regression() function from
X, y = make_regression(n_features=1, n_informative=1, bias=1, noise=35)
Let’s plot this dataset to see how it looks like:
The y returned by
make_regression() is a flat vector. We will reshape it to a column vector to use with our
y = y.reshape((-1, 1))
Firstly, we will use
method = ‘solve’ to fit the regression line:
lr_solve = LinearRegression().fit(X, y, method='solve') plt.scatter(X, y) plt.plot(X, lr_solve.predict(X), color='orange')
The root mean squared error of the above regression model is:
lr_solve.rmse(X, y) # 35.59874949855057
Then, we also use
method = ‘sgd’ and we will let the other parameters have their default values:
lr_sgd = LinearRegression().fit(X, y, method='sgd') plt.scatter(X, y) plt.plot(X, lr_sgd.predict(X), color='orange')
As you can see, the regression lines in the 2 images above for methods ‘solve’ and ‘sgd’ are almost identical.
The root mean squared error we got when using ‘sgd’ is:
lr_sgd.rmse(X, y) # 36.34038690848635
Here is the Jupyter Notebook with all the code:
I hope you found this information useful and thanks for reading!
This article is also posted on Medium here. Feel free to have a look!