Exploring the naïvety of Naive Bayes

Somaiya ML Research Association
8 min readJun 16, 2021

ML-DL101 Series: Part 5

Probability and Dice: An eternal love story

Introduction

A machine learning beginner who starts from scratch knows all the fancy Machine Learning algorithms such as Support Vector Machines, Random forests, etc. But, a simplistic yet powerful algorithm hidden in plain sight is often dropped out(pun intended) by many. As you might have guessed by now, we are talking about Naive Bayes. As the name suggests, this classification algorithm is based on the very famous Bayes Theorem of probability. Simply put, Naive Bayes is a probabilistic classifier that comes under the family of supervised learning algorithms.

Bayes Theorem

Before diving into the Naive Bayes, one needs to have a clear understanding of the Bayes theorem in probability. Simply put, Bayes Theorem gives us the relationship between two conditional events i.e. the probability that the second event will occur given that the second event depends on the probability of the first event occurring. Mathematically for two events A and B, it can be stated as follows:

Bayes Theorem

The above equation gives the probability of event A given that event B has occurred as a ratio of the probability of event B occurring given event A has occurred multiplied by the probability of event A and the probability of event B occurring. So, the right-hand side denominator is the summation of all the events which lead to event B occurring. This can be better understood from the below formula.

Bayes Theorem expanded

For a more intuitive understanding, one can simply say that the probability of a conditional event P(A|B) is the product of two quantities. The first one is the ratio of the chance that had event A occurred in the presence of event B and the chance that event B occurs. The second one is the chance that event A occurs. So the Bayesian probability is explaining to us that the probability of event A occurring in the presence of event B is the product of the chance that A has occurred in the world of B (i.e. a world where B definitely occurs) and the probability that A might occur independently of any other event in the universal set.

Independent Events

Independent events as the name suggests are those events whose occurrences are independent of one another, i.e. the occurrence of an event does not affect the occurrence of another event. Speaking mathematically, two events, A and B would be independent if and only if the probability of both of them occurring is the product of the probabilities of each one of them occurring independently. Formulating this,

Independent Events

Independent events imply that the probability of event A given that event B has occurred would be equal to the probability of event A since being independent events mean that the occurrence of either would not affect the other i.e. for independent events, P(A|B) = P(A).

Naive Bayes, a Bayesian Model

Now that we are well versed with the basic concepts of probability, we are good to go further. Since Naive Bayes is a Bayesian classification method, it works on finding the probabilities of the datapoint belonging to a particular class given the feature vectors. It first calculates the probability of a particular feature value belonging to a particular class and then considering that each feature is independent of one another, the combined probability is computed on the basis of the probabilities of all the features. The final result is multiplied by the probability of that particular class. The “naiveness” in Naive Bayes arises from the fact that it assumes each feature is independent of one another. Hence, the combined probability is simply calculated as the product of all the probabilities of individual features.

The math behind Naive Bayes

It seems that the above paragraph totally disregards the concepts that we mentioned earlier, right? Well, actually no. So, let us know the reasoning and the working behind the above algorithm. Assuming ‘n’ features for a class, according to the Bayes theorem, the probability of a data point belonging to a class given the ‘n’ features, is given by:

Now, for each class, the RHS denominator is going to be the same, i.e.

So we can remove this as we would be comparing the probabilities for each class. Hence, multiplying or dividing by a constant won’t change our results. So, now for a class, we get,

Since, we are assuming that each feature is independent of one another, from the Bayes theorem, we can say that the conditional probability will be a product of individual conditional probabilities.

Putting this in the original equation, we get,

Now, the RHS of the above equation can be easily calculated with the help of a probability distribution function. Since we need the probability that a feature belongs to a particular class, we can calculate it if we have the probability distribution function. This is where we assume that the data follows a Gaussian probability distribution and we use the formula for the same. The Naive Bayes using the Gaussian distribution function is termed as Gaussian Naive Bayes. Similarly, there are Naive Bayes that use binomial as well as multinomial distribution according to the use case.

Example of a Naive Bayes

Let us understand the working of Naive Bayes in depth with a simple example. For simplicity let our dataset contain 10 datapoints having 2 features each. These 10 data points belong to 2 classes named 0 & 1. To further simplify things, we would be working on Gaussian Naive Bayes.

Sample Dataset

Now that we have the data points, we calculate the mean and standard deviation of each of the features separately for each of the classes which would be used in the Gaussian Probability function. Since there are two different classes(0 &1), each having 2 features, we would need to find the mean and standard deviation of each of the 4 different sets. After computing, we get,

Mean and Std. deviation of both the classes

We consider the Gaussian distribution a.k.a. the normal distribution of the feature vectors. We then compute the Gaussian probability of each feature by the formula,

Gaussian Probability function

Using the above formula, the Gaussian probability for class 0, for a sample data point having the feature vector (7,3) is computed for each feature in each of the two classes. The product of the feature probabilities is then multiplied by the class probability. The table below shows the Gaussian probability calculations for the datapoint (7,3) followed by the final computed class probability.

Gaussian probabilities and Naive Bayes calculation for a data point (7,3)

From the above table, we can infer that the datapoint with feature vector [7,3] belongs to class 1 according to Naive Bayes since the final probability of the data point being in class 1 is 0.0401 which is higher as compared to the class 0 probability which is 0.00000328. This classification can be justified as we can see from the dataset that the class 1 datapoints have both the feature vector values closer to 7 and 3 respectively. Warning: This analysis is done on a sample dataset, so just for the sake of understanding, we have justified this sample data point belonging to the class in this trivial manner. One should not try these stunts on actual data :)

To sum up, we multiply the product of Gaussian probability for each feature by the probability of that particular class. In this way, the probability of the datapoint belonging to that particular class is computed and then compared with probabilities for other classes. The data point is then classified into the class having the highest probability.

When and why to select Naive Bayes?

One of the major drawback as well as the advantage of Naive Bayes is that this algorithm considers each feature of a data point to be independent of other. This makes it inaccurate and a redundant model where features are correlated or the task & data are complex such as image classification as in images, the pixels are related to one another. Still, Naive Bayes makes up a good baseline prediction model for certain tasks since it relies on straight assumptions and works well where correlation is absent as well as the distribution of data points is similar to the one we assumed. Another advantage of Naive Bayes is that since it is a simple probabilistic model, it takes practically very little time to train (rather, perform simple computations), and hence it is fast. Hence, it can be used on real-time data. Since we know the actual working and intuition, this is very much interpretable as compared to complex Deep Learning models like neural networks and other ML algorithms. To sum up the applications of Naive Bayes:

  • Naive Bayes performs exceptionally well where we have categorical inputs as compared to numerical inputs because it is a probabilistic model.
  • But as much as the above is an advantage, it vastly underperforms if there is a new categorical variable of a particular category that was not present in the training data. So, one shall avoid Naive Bayes at such places.
  • Naive Bayes performs well on multi-class classification. Hence, it is used extensively in certain NLP applications like sentiment analysis, spam-ham detection, text classification, etc.
  • Naive Bayes is also very useful in real-time NLP applications too owing to the fast results it provides.

Conclusion

Naive Bayes is one of the most simplistic probability-based machine learning models which uses data to calculate the class probabilities of the testing data. It simplifies the computations by assuming that features are independent which makes it prone to errors as well as makes the model a fast and simple model. The code for the Naive Bayes algorithm can be found on SMLRA’s Github repository. In one of our workshops, we did explain Naive Bayes in depth which can also be found on YouTube.

--

--