## Understand better linear regression & sharpen your NumPy skills

Let’s first briefly recall what linear regression is:

Linear regression is estimating an unknown variable in a linear fashion based on some other known variables. Visually, we fit a line (or a hyperplane in higher dimensions) through our data points.

If you’re not comfortable with this concept or want to understand better the math behind it, you can read my previous article about linear regression:

Now, let’s focus on implementation.

Firstly, we need to, obviously, import some libraries. We import `numpy` as it is the main thing we use for the implementation, `matplotlib` for visualizing our results, and `make_regression` function, from `sklearn`, which we will be using to generate a regression dataset for using as an example.

``````import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_regression``````

Then we will create a `LinearRegression` class with the following methods:

• `.fit()` — this method will do the actual learning of our linear regression model; here we will find the optimal weights
• `.predict()` — this one will be used for prediction; it will return the output of our linear model
• `.rmse()` — computes the root mean squared error of our model with the given data; this metric is kind of “the average distance from our model’s estimate to the true y value”

The first thing we do inside `.fit()` is to concatenate an extra column of 1’s to our input matrix X. This is to simplify our math and treat the bias as the weight of an extra variable that’s always 1.

The `.fit()` method will be able to learn the parameters by using either closed-form formula or stochastic gradient descent. And to choose which to use, we will have a parameter called `method` that will expect a string of either ‘solve’ or ‘sgd’.

When `method` is set to ‘solve’ we will get the weights of our model by the following formula:

which requires the matrix X to have full column rank; so, we will check for this and otherwise we show an error message.

The first part of our `.fit()` method is:

Note that the other parameters after `method` are optional and are used only in the case we use SGD.

The second part of this method handles the case of `method = ‘sgd’`, which doesn’t require that X has full column rank.

The SGD algorithm for our least squares linear regression is sketched below:

We will start this algorithm by initializing the weights class attribute to a numpy vector with values drawn from a normal distribution with mean 0 and standard deviation 1/(number of columns). We divide the standard deviation by the number of columns to make sure we don’t get too big values as output in the initial stages of the algorithm. This is to help us converge faster.

At the beginning of each iteration, we randomly shuffle our rows of data. Then, for each batch, we compute the gradient and subtract it (multiplied by the learning rate) from the current weights vector to obtain the new weights.

Below is the second half of our `.fit()` method:

We return `self` from this method to be able to concatenate the calls of the constructor and `.fit()` like this: `lr = LinearRegression().fit(X, y, ‘solve’)`.

The `.predict()` method is quite straight-forward. We first check if `.fit()` was called before, then concatenate a column of 1’s to X and verify that the shape of X allows multiplication with the weights vector. If everything is OK, we simply return the result of the multiplication between X and the weights vector as the predictions.

In `.rmse()` we first get the outputs of the model using `.predict()`, then if there were no errors during predict, we compute and return the root mean squared error which can be thought of as “the average distance from our model’s estimate to the true y value”.

Below is the full code of the `LinearRegression` class:

### Using our `LinearRegression` class in an example

To show our implementation of linear regression in action, we will generate a regression dataset with the `make_regression()` function from `sklearn`.

``````X, y = make_regression(n_features=1,
n_informative=1,
bias=1, noise=35)``````

Let’s plot this dataset to see how it looks like:

``plt.scatter(X, y)``

The y returned by `make_regression()` is a flat vector. We will reshape it to a column vector to use with our `LinearRegression` class.

``y = y.reshape((-1, 1))``

Firstly, we will use `method = ‘solve’` to fit the regression line:

``````lr_solve = LinearRegression().fit(X, y, method='solve')
plt.scatter(X, y)
plt.plot(X, lr_solve.predict(X), color='orange')``````

The root mean squared error of the above regression model is:

``````lr_solve.rmse(X, y)
# 35.59874949855057``````

Then, we also use `method = ‘sgd’` and we will let the other parameters have their default values:

``````lr_sgd = LinearRegression().fit(X, y, method='sgd')
plt.scatter(X, y)
plt.plot(X, lr_sgd.predict(X), color='orange')``````

As you can see, the regression lines in the 2 images above for methods ‘solve’ and ‘sgd’ are almost identical.

The root mean squared error we got when using ‘sgd’ is:

``````lr_sgd.rmse(X, y)
# 36.34038690848635
``````

Here is the Jupyter Notebook with all the code:

I hope you found this information useful and thanks for reading!

Categories: Uncategorized

#### Dorian

Passionate about Data Science, AI, Programming & Math

Article Rating
Subscribe
Notify of
1 Comment
Inline Feedbacks