Understanding Decision Trees
ML-DL 101 Series: Part 1
Starting our new series on “ML-DL 101”, today we will cover decision trees and random forests.
Introduction:
A decision tree is an algorithm that uses a tree-like model of various decisions and their consequences. Decision tree algorithm is a supervised learning algorithm used for both classification as well as regression tasks. Essentially it makes the decision based on a series of questions that are asked to a particular data point. Thus the data points are then separated based on the answer to those questions. This is easily depicted with the help of the diagram below:
The above is a simple decision tree for buying a car. From the decision tree, we can interpret that if the color of the car is red then the buyer will look for the latest model and if the model is new i.e. it is released after 2010 then the buyer will buy the car. Similarly, if the model is older, the buyer will look for the mileage and then decide. The same is the case when the color of the car is not red.
A decision tree basically works on the features of the data. It classifies the data according to features and then interprets the formed set of data. Another point to be noted is that there can be more than one type of question for features. Like there are yes/no questions i.e. Boolean as well as numerical data.
Terminologies :
Node: These features are present at the tree nodes. Nodes are basically the decision making entities in decision trees.
Root Node: The nose at the very top of the tree is called the root node.
Leaf Node: The node that has no other nodes or which is the final decision is called the leaf node.
Mathematics :
For understanding the mathematics behind the decision trees we first need to understand a few concepts related to it.
Impurity: Impurity is the measure of the difference in the samples. Basically it measures how much of the sample data points are different in nature. It can be explained with the help of the following example.
Suppose we have 3 baskets, the first one containing all apples, the second one containing half apples and half oranges and the third one containing all oranges. The first one and the third one contain only one type of information and hence this information is pure. So its impurity will be less. However, the second one contains two types of fruits and hence the information is not pure and therefore the impurity will be high in this case.
Impurity essentially measures the homogeneity of data points. It indicates that if the data is homogenous i.e. the data points have the same characteristics then those data points belong to the same class and in this case the same node of the tree. Now the impurity is largely measured by two terms:
- Entropy :
Entropy is the measure of randomness. It is the amount of information needed to accurately describe the sample. Its value is high if the elements are dissimilar or random and its value is low if the elements are similar or have the same properties. Entropy’s value always lies between 0 and 1 both included. So in our case of the three baskets, the entropy of the first and the third basket will be zero as they have all the similar elements and that of the second will be 0.5. However, if there was a basket such that it contained all the different fruits then its entropy would have been one. Mathematically Entropy is described as:
Where pi is the probability of a similar event occurring and n is the number of sample points.
- Gini Index:
Gini Index is the measure of inequality in a sample. Its value is also in the range of 0 to 1. Like Entropy Gini index value of 0 indicates that the sample has similar data and the Gini value of 1 indicates that the sample data is quite different. Mathematically, Gini Index is given by:
We now know what Gini Index and impurity are so we can dive into how they help us make a decision tree.
The algorithm behind the decision tree is pretty much straightforward- We first calculate the Gini index for all of the features first and then the feature with the least Gini impurity value makes it to the top of the decision tree. Then we segregate the data points into the attributes of the feature and again calculate the Gini index for the remaining features and the one with the lowest value goes to the top of the sub-tree, below the previous one. Well but there are a few complexities involved. To explore more deeply we will try to understand the algorithm more clearly with the help of an example.
We would take the classic Weather data set for this. Here’s what this small data set looks like.
Table 1: Data Points
As seen we have 4 features here — outlook, temperature, humidity and wind. Here the decision is simple as just a Boolean value i.e. to play Yes or No. It is to be noted that the decision might not always be so simple it might be a broader classification task or a regression task too.
So let us start with the outlook. Here we have two events that can occur yes or no. And the attributes of the feature outlook are 3 i.e. overcast, rainfall and sunny. For a better understanding, we have the table below:
Now we calculate the Gini impurity for each attribute of the feature - outlook.
For Sunny, we get:
Gini Impurity = 1 — (2/5)² — (3/5)² = 0.48
For Rainfall, we get:
Gini impurity = 1 — (4/4)² — (0/4)² = 0
For Overcast, we get:
Gini impurity = 1 — (3/5)² — (2/5)² = 0.48
Now, since we have the Gini Impurity for each of the attributes of the feature outlook, we calculate the overall Gini impurity of the feature outlook. Basically we are calculating the weighted impurity.
Gini Impurity = g1*w1 + g2*w2 + g3*w3
Where w1+w2+w3 =1 and w1, w2,w3 are essentially the weights of each of the attributes i.e.
wi = instances/cases of ith attribute / Total instances i.e. 14
wi = (5/14)*0.48 + (4/14)*0 + (5/14)*0.48
wi = 0.342
So the Gini impurity of outlook is 0.342.
Similarly, we calculate the Gini impurity of all the other features. Since it is just the calculations, we are skipping it. The values are:
Gini Impurity of temperature is 0.439
Gini Impurity of humidity is 0.367
Gini Impurity of wind is 0.428
Now the Gini impurity for outlook is the lowest. Hence we have our root node as the outlook. Next, we have 3 attributes for the outlook and so we have three branches coming out of the root node. This is shown in the tree below
Next for each of the branches, we have 3 features left. We take one branch at a time. For instance, let us take sunny as the outlook. So what we do is that we take into account only those data points having a sunny outlook. So as per table 1, we will take into account days 1,2,8,9 and 11.
Now we calculate the Gini impurity for each of the remaining three features of these 5 data points only.
Gini impurity for temperature (hot) = 1 — (0/2)² — (2/2)² = 0
Gini impurity for temperature (mild) = 1 — (1/2)² — (1/2)² = 0.50
Gini impurity for temperature (cool) = 1 — (1/1)² = 0
Gini impurity for temperature = 0 * (2/5) + 0.5*(2/5) + 0*(1/5) = 0.2
Similarly, we calculate the Gini impurity of the other two features too.
Gini impurity for humidity = 0
Gini impurity for wind = 0.466
The lowest Gini value is of humidity and hence it is the next node for our decision tree. Similarly, we proceed with the other two attributes of the temperature first and then continue the way downwards until the decision tree is obtained using all of the data points. After all the calculations the decision tree looks something like this below:
Note that in the above example we had only the classified data i.e. each data had a specific set of attributes. However, if they had numerical data then there would be certain changes. Like had the wind been measured in km/s instead of either strong or weak we would then have to set a critical value or a set of critical values for which we would have classified the wind as strong, mild, weak, etc. basically what value of wind speed will significantly affect our decision. So for this we follow a simple procedure. We sort the sample points either in ascending or descending order of the numeric feature. (If there are more than one feature having numeric data then we need to do this procedure separately for each feature)Then we calculate the mean of the adjacent values. Now we calculate the Gini impurity value for each of these means. Again we follow the same procedure of selecting the lowest impurity value and then we segregate the data based on that impurity value.
Significance of Decision Tree:
Decision trees essentially point out the actual importance of the various features of the data points and then segregate the data on the basis of those features itself. But on a large data set with a lot of features this way of segregating data can be quite cumbersome and may not always be as efficient. There might be a problem of overfitting if there are a lot of unnecessary features or features of very little importance in the data set. This problem can be overcome by pruning. The easiest way of pruning is that starting from the leaf nodes it removes the nodes with the least popular classification without harming the accuracy of the tree. Hence many times, complex data segregation can be solved with the help of decision trees or their advanced versions i.e. random forests.
Besides there is one of the major disadvantage of the decision trees i.e. the decision trees work great with the data used to create them i.e. the decision trees work very well with the training data as well as testing data too but in practice when some other sets of data are used they can be misleading and their accuracy drops. This is a very obvious thing as the decision trees are based on the data provided however if the input data has a bit different characteristics then decision trees can make poor predictions.
Random Forests:
Random Forests combine the simplicity of the decision trees with the flexibility and hence improve the accuracy to a significant extent. What random forests do is that they create a forest of many decision trees. Of the given data set a subset of data is chosen randomly and a decision tree having only some randomly selected features is created out of it. One data point can appear in more than one tree. After having created the forest we take the new testing example and run it through all the trees of the forests and keep track of the decision each tree provides. After the data point has gone through all the trees in the forests we take the aggregate of the results and then calculate the final answer. This helps us to have more flexibility and accuracy as well. This method of randomly selecting the data and features (bootstrapping) and then taking the average (aggregate) is called bagging.
Stay tuned for new articles in the ML-DL 101 series!