Implementing a Maximum Likelihood Classifier and using it to predict heart disease
What is this thing about?
The main idea of Maximum Likelihood Classification is to predict the class label y that maximizes the likelihood of our observed data x. We will consider x as being a random vector and y as being a parameter (not random) on which the distribution of x depends. At first, we need to make an assumption about the distribution of x (usually a Gaussian distribution).
Then, the learning of our data consists of the following:
- We split our dataset into subsets corresponding to each label y.
- For each subset, we estimate the parameters of our assumed distribution for x using only the data inside that subset.
When making a prediction on a new data vector x:
- We evaluate the PDF of our assumed distribution using our estimated parameters for each label y.
- Return the label y for which the evaluated PDF had the maximum value.
Let’s start with a simple example considering a 1-dimensional input x, and 2 classes: y = 0, y = 1.
Let’s say that after we estimated our parameters both under y = 0 and y = 1 scenarios, we get these 2 PDFs plotted above. The blue one (y = 0) has mean ? = 1 and standard deviation ? = 1; the orange plot (y = 1) has ? = −2 and ? = 1.5. Now, if we have a new data point x = -1 and we want to predict the label y, we evaluate both PDFs: ?₀(−1) ≈ 0.05; ?₁(−1) ≈ 0.21. The biggest value is 0.21, which we got when we considered y = 1, so we predict label y = 1.
That was just a simple example, but in real-world situations, we will have more input variables that we want to use in order to make predictions. So, we need a Multivariate Gaussian distribution, which has the following PDF:
For this method to work, the covariance matrix Σ should be positive definite; i.e. it should be symmetric and all eigenvalues should be positive. The covariance matrix Σ is the matrix that contains the covariances between all pairs of components of x: Σ?? = ???(??,??). So, it is a symmetric matrix as ???(??,??) = ???(??,??), and therefore all we have to check is that all eigenvalues are positive; otherwise, we will show a warning. If there are more observations than variables and the variables don’t have a high correlation between them, this condition should be met, Σ should be positive definite.
Now, let’s implement it
Using MLClassifier to predict heart disease
For this task, we will use the dataset provided here. This dataset consists of a csv file which has 303 rows, each one has 13 columns that we can use for prediction and 1 label column. A short description of each field is shown in the table below:
We got 80.33% test accuracy. Although this method doesn’t give an accuracy as good as others, I still think that it is an interesting way of thinking about the problem that gives reasonable results for its simplicity.
The Jupyter notebook can be found here.
I hope you found this information useful and thanks for reading!
This article is also posted on Medium here. Feel free to have a look!