The math behind Linear Regression explained in detail
Let’s say you’re looking to buy a new PC from an online store (and you’re most interested in how much RAM it has) and you see on their first page some PCs with 4GB at $100, then some with 16 GB at $1000. Your budget is $500. So, you estimate in your head that given the prices you saw so far, a PC with 8 GB RAM should be around $400. This will fit your budget and decide to buy one such PC with 8 GB RAM.
This kind of estimations can happen almost automatically in your head without knowing it’s called linear regression and without explicitly computing a regression equation in your head (in our case: y = 75x – 200).
So, what is linear regression?
I will attempt to answer this question simply:
Linear regression is just the process of estimating an unknown quantity based on some known ones (this is the regression part) with the condition that the unknown quantity can be obtained from the known ones by using only 2 operations: scalar multiplication and addition (this is the linear part). We multiply each known quantity by some number, and then we add all those terms to obtain an estimate of the unknown one.
It may seem a little complicated when it is described in its formal mathematical way or code, but, in fact, the simple process of estimation as described above you probably already knew way before even hearing about machine learning. Just that you didn’t know that it is called linear regression.
Now, let’s dive into the math behind linear regression.
In linear regression, we obtain an estimate of the unknown variable (denoted by y; the output of our model) by computing a weighted sum of our known variables (denoted by xᵢ; the inputs) to which we add a bias term.
Where n is the number of data points we have.
Adding a bias is the same thing as imagining we have an extra input variable that’s always 1 and using only the weights. We will consider this case to make the math notation a little easier.
Where x₀ is always 1, and w₀ is our previous b.
To make the notation a little easier, we will transition from the above sum notation to matrix notation. The weighted sum in the equation above is equivalent to the multiplication of a row-vector of all the input variables with a column-vector of all the weights. That is:
The equation above is for just one data point. If we want to compute the outputs of more data points at once, we can concatenate the input rows into one matrix which we will denote by X. The weights vector will remain the same for all those different input rows and we will denote it by w. Now y will be used to denote a column-vector with all the outputs instead of just a single value. This new equation, the matrix form, is given below:
Given an input matrix X and a weights vector w, we can easily get a value for y using the formula above. The input values are assumed to be known, or at least to be easy to obtain.
But the problem is: How do we obtain the weights vector?
We will learn them from examples. To learn the weights, we need a dataset in which we know both x and y values, and based on those we will find the weights vector.
If our data points are the minimum required to define our regression line (one more than the number of inputs), then we can simply solve equation (1) for w:
We call this thing a regression line, but actually, it is a line only for 1 input. For 2 inputs it will be a plane, for 3 inputs it will be some kind of “3D plane”, and so on.
Most of the time the requirement for the solution above will not hold. Most of the time, our data points will not perfectly fit a line. There will be some random noise around our regression line, and we will not be able to obtain an exact solution for w. However, we will try to obtain the best possible solution for w so that the error is minimal.
If equation (1) doesn’t have a solution, this means that y doesn’t belong to the column space of X. So, instead of y, we will use the projection of y onto the column space of X. This is the closest vector to y that also belongs to the column space of X. If we multiply (on the left) both sides of eq. (1) by the transpose of X, we will get an equation in which this projection is considered. You can find out more about the linear algebra approach of solving this problem in this lecture by Gilbert Strang from MIT.
Although this solution requires fewer restrictions on X than our previous one, there are some cases in which it still doesn’t work; we will see more about this issue below.
Another way to get a solution for w is by using calculus. The main idea is to define an error function, then use calculus to find the weights that minimize this error function.
We will define a function f that takes as input a weights vector and gives us the squared error these weights will generate on our linear regression problem. This function simply looks at the difference between each true y from our dataset and the estimated y of the regression model. Then squares all these differences and adds them up. In matrix notation, this function can be written as:
If this function has a minimum, it should be at one of the critical points (the points where the gradient ∇f is 0). So, let’s find the critical points. If you’re not familiar with matrix differentiation, you can have a look at this Wikipedia article.
We start by computing the gradient:
Then we set it equal to 0, and solve for w:
We got one critical point. Now we should figure out if it is a minimum or maximum point. To do so, we will compute the Hessian matrix and establish the convexity/concavity of the function f.
Now, what can we observe about H? If we take any real-valued vector z and multiply it on both sides of H, we will get:
Because f is a convex function, this means that our above-found solution for w is a minimum point and that’s exactly what we were looking for.
As you probably noticed, we got the same solution for w by using both the previous linear algebra approach and this calculus way of finding the weights. We can think of it as either the solution of the matrix equation when we replace y by the projection of y onto the column space of X or the point that minimizes the sum of squared errors.
Does this solution always work? No.
It is less restrictive than the trivial solution: w = X⁻¹ y in which we need X to be a square non-singular matrix, but it still needs some conditions to hold. We need Xᵀ X to be invertible, and for that X needs to have full column rank; that is, all its columns to be linearly independent. This condition is typically met if we have more rows than columns. But if we have fewer data examples than input variables, this condition cannot be true.
This requirement that X has full column rank is closely related to the convexity of f. If you look above at the little proof that f is convex, you can notice that, if X has full column rank, then X z cannot be the zero vector (assuming z ≠ 0), and this implies that H is positive definite, hence f is strictly convex. If f is strictly convex it can have only one minimum point, and this explains why this is the case in which we can have a closed-form solution.
On the other hand, if X doesn’t have full column rank, then there will be some z ≠ 0 for which X z = 0, and therefore f is non-strictly convex. This means that f may not have a single minimum point, but a valley of minimum points which are equally good, and our closed-form solution is not able to capture all of them. Visually, the case of a not full column rank X looks something like this in 3D:
A method that will give us a solution even in this scenario is Stochastic Gradient Descent (SGD). This is an iterative method that starts at a random point on the surface of the error function f, and then, at each iteration, it goes in the negative direction of the gradient ∇f towards the bottom of the valley.
This method will always give us a result (even if sometimes it requires a large number of iterations to get to the bottom); it doesn’t need any condition on X.
Also, to be more efficient computationally, it doesn’t use all the data at once. Our data matrix X is split vertically into batches. At each iteration, an update is done based only on one such batch.
In the case of not full column rank X, the solution will not be unique; among all those points in the “minimum valley”, SGD will give us only one that depends on the random initialization and the randomization of the batches.
SGD is a more general method that is not tied only to linear regression; it is also used in more complex machine learning algorithms like neural networks. But an advantage that we have here, in the case of least-squares linear regression, is that, due to the convexity of the error function, SGD cannot get stuck into local minima, which is often the case in neural networks. When this method will reach a minimum, it will be a global one. Below is a brief sketch of this algorithm:
Where α is a constant called learning rate.
Now, if we plug in the gradient as computed above in this article, we get the following which is specifically for least-squares linear regression:
And that’s it for now. In the next couple of articles, I will also show how to implement linear regression using some numerical libraries like NumPy, TensorFlow, and PyTorch.
I hope you found this information useful and thanks for reading!
This article is also posted on Medium here. Feel free to have a look!