The math of this method explained in detail
What is logistic regression? Logistic regression is just adapting linear regression to a special case where you can have only 2 outputs: 0 or 1. And this thing is most commonly applied to classification problems where 0 and 1 represent two different classes and we want to distinguish between them.
Linear regression outputs a real number that ranges from -∞ to +∞. And we can use just this even in the 0/1 classification problem: if we get a value >= 0.5 report it as class label 1, if the output is < 0.5 report it as a 0.
Where x is a vector of features (plus a component with constant 1 for bias) of one observation, and w is the weights vector.
But, we can get slightly better results both in terms of accuracy and interpretability if we squash the regression line into an “S”-shaped curve between 0 and 1. We squash the regression line by applying the sigmoid function to the output value of a linear regression model.
Below is the sigmoid function:
More exactly, we compute the output as follows: take the weighted sum of the inputs, and then pass this resulting number into the sigmoid function and report the sigmoid’s output as the output of our logistic regression model.
This procedure helps us both in getting slightly better accuracy and in the interpretability of the output. If our model outputs any real number like -5 or 7, what these numbers actually mean? What can we tell about our two classes: 0 and 1?
But, when we have outputs between 0 and 1, we can interpret them as probabilities. The output of a logistic regression model is the probability of our input belonging to the class labeled with 1. And the complement of our model’s output is the probability of our input belonging to the class labeled with 0.
Where y is the true class label of the input x.
OK. So, by now we have seen how a logistic regression model obtains its outputs, given the input. But what about its weights? What weights should it have to make good predictions?
Our model needs to learn those weights, and the way it learns is by giving to our model an objective function, then it finds the weights that minimize or maximize this objective.
There are many ways we can come up with an objective function, especially if we consider adding regularization terms to our objective.
In this article, we’ll explore only 2 such objective functions.
First, let’s write our logistic regression model as follows:
Where X is a matrix that contains all our observations as rows, and columns represent the features. y hat is the output of our model, it is a vector that contains the predictions for each observation.
Let’s rewrite our logistic regression equation in the following way:
The operations on the right-hand side of the last line are element-wise.
What do you observe in the last line above? If we apply the function on the right-hand side of the last equation on the labels for logistic regression and consider the output of this function application as the new labels, then we obtain a linear regression. So, we can use the sum of squared errors as a loss function and find the weights that minimize it. We can find the weights by using either a closed-form formula or SGD (stochastic gradient descent) as you can read more about in the following article on linear regression:
Below are the closed-form solution and the gradient of the loss (that we can use in the SGD algorithm) for linear regression:
For logistic regression we just need to replace the y in these 2 equations above with the right-hand side of the previous equation:
When we apply these formulas, we provide the true labels for y hat.
So, we can treat logistic regression as a form of linear regression and use the tools of linear regression to solve logistic regression. OK. What can we do besides that?
We can take advantage of the properties of logistic regression to come up with a slightly better method. What type of output does logistic regression have? A probability.
A convenient method to apply when probabilities are involved is the Maximum Likelihood Estimation. We will find the weights of our model that maximizes the likelihood of the labels given the inputs.
We start by writing the likelihood function. The likelihood is just the joint probability of labels given the inputs, which, if we assume observations to be independent, can be written as the product of the probabilities for each observation.
Where m is the number of observations.
The likelihood is a function of everything: inputs x, true labels y, and weights w. But for our purposes here (maximizing it with respect to w) we will consider it further as a function of just w. x and y we consider as given constants that we cannot change.
Each one of the individual probabilities has one of the following values, depending on yi being 0 or 1:
A more compact way of writing this is:
Now, we replace this quantity in the likelihood function, simplify the argmax of it, and transition to matrix notation:
As you can see above, maximizing the likelihood w.r.t. the weights is the same thing as minimizing the quantity on the last line. Finding a closed-form solution this time is more difficult (if even possible), so the best thing we can do is to compute the gradient of this quantity:
Where: the operations involved in that fraction above are element-wise. The dot before X means “multiply element-wise the column vector on the left with each column of the matrix X”. The 1s above are column vectors with the same shape as y filled with values of 1.
Now, the above gradient can be used with a gradient-based optimization algorithm (like SGD) to find the optimal weights.
Before we’re done, let’s recap a few things that we saw through this article:
- When can we use logistic regression? A: When we have a binary classification problem.
- How a logistic regression model obtains its output? A: It computes a weighted sum of its inputs, then passes it through the sigmoid function. The output can be interpreted as a probability.
- How we can find the weights of the model? A: We can play around with the labels so that we can still use linear regression, or we can use something more suited to it like MLE. MLE tends to give slightly better results.
And that’s it for this article. I hope you found it useful.
In the next couple of articles, I will show how to implement logistic regression in NumPy, TensorFlow, and PyTorch.
I hope you found this information useful and thanks for reading!
This article is also posted on Medium here. Feel free to have a look!