Tree Based Algorithms - All that you need to know!
"Learn how the tree based algorithms make decision and compare against the linear counterparts. "

Photo by Pranit Bhujel
Intuition behind trees!¶
Generally when we look at tree based algorithms, we start with Decision Trees. Decision Trees are one of the most widely used algorithms used for solving supervised machine learning problems. If we look at it from a layman's perspective, then we can say that decision trees follow similar process of how a human takes their decision.
In this technique, the given sample or population is divided into One good thing about tree based algorithms is that it performs well for both the classification and regression problems.
These algorithms follow a top-down greedy approach called Recursive Binary Splitting. It is greedy because it looks only for the current best scenario for the split!
The most relevant attributes are kept on the top as root nodes, and as we go into more depth, the relevance of the attributes is reduced. This is true to the fact that when we're going to buy something, we don't usually by anything and then find it's purpose, what we do is to first try and identify the item that's required and then go shopping.
. . .¶
Important Terminologies in Trees
Root Node: The attribute which divides the rest of the sample data into heterogeneous groups of homogeneous set is called as root node.
Decision Node: The internal nodes which helps in decision making or routing the path for the new test data to find its label.
Splitting: The process of dividing the node into two or more sub-nodes.
Pruning: The process opposite of splitting which involves combining two or more sub-nodes to combine into one.
Leaf/Terminal Node: The final label node, which cannot be further splitted is called as Leaf Node.
Parent/Child Node: The node which has sub-nodes below it is called a parent node, and the nodes below a parent node are called Child Nodes.
Branch/Sub-Tree: A section of the entire tree is called as a branch or sub-tree.
Pure Node: If a node has all samples belonging to just one class, we say that it's a pure node, or high homogeneity or less entropy.
Impure Node: If a node that has mixed samples from a few classes, we say that the node is impure or less homogeneous or high entropy.
. . .¶
How does a Tree based model work?
Decision Trees measure the statistical significance of all the attributes in the dataset w.r.t the label to be predicted. There are a few ways to find out the significance of an attribute. Once the most significant attribute is found, it is made as the root node of the tree which divides the sample into sub-samples which are then again sub-sampled using the remaning attributes. The recursive process is continued until any node can be no further divided!
Once that happens, we're left with root node, decision nodes and leaf nodes. The formed tree when fetched with a new data would now be able to decide the final result for us.
. . .¶
How are Tree based algorithm better than Linear Regression?
Tree based algorithms perform really well compared to linear models when it comes to mapping non-linear realtionships in the dataset. If there's a linear relationship in a data, linear regression / linear models will outlive the tree based models.
It's only when the data is non linear, the trees really shine!
. . .¶
Working of Trees - Example¶
What better way to understand than with an automotive example!
Say that someone's asking a help for buying a bike. So, as someone who's knowledgable in that area, how'd you shape your decisions and how'd you go from the start is very similar to the way Decision Trees or Tree Based algorithms work.
The first question - perhaps the purpose of buying the bike - Touring, Commuting, Cruising, Sports, Naked, Street Fighter. This would act as a solid root node as it divides the data into homogeneous sets, with each set being heterogeneous of each other. Each set having a different purpose.
The second question - it could now be divided across multiple price point, so the second question - the price would further divide the existing nodes into multiple sub nodes.
By this time we have segregated the price range of the bikes of different category.
The third question - Now we may want to ask another question regarding the volume of the engine. What kind of power would I want from the engine, which would again divide the nodes into sub-nodes.
And this way when all the questions are answered - the milegae - the warranty - service cost - resell value - weight and all, we get the final decision, which in this case would be the bike we're desiring!
. . .¶
And that's Decision Tree in a nutshell. But we're left with much more questions aren't we?
- So how does the algorithm determine which attribute to keep as root node?
- On what basis does the algorithm split the nodes?
- How do we know which node is a leaf node and which should be further splitted?
- How do we know when to stop - the depth of the tree?
. . .¶
The way we split the node heavily affects a tree's accuracy
The reason is because, each node adds to the final decision. If we split the node without much thought, the decision that'll come out of it will have not much accuracy. So, it's very important for us to decide which node to split and when.
How does the algorithm figure out which node to split?
The algorithm uses a few statistical techniques (Gini, Information Gain, Chi Squared, Reduction in variance) to find the significance of an attribute and decides when to split. Different methods are used for splitting the node, based on the target data - if it's continuous (Regression) or discrete (Classification)
. . .¶
The creation of sub-nodes increases the homogeneity of resultant sub-nodes
Homogeneity: Data points being in the same class. Increase in homogeneity means that most of the data in a node belongs to the same class.
Entropy: The measure of randomness in a data is called Entropy. More entropy means that the sample of data contains data elements of multiple class.
'The main goal of splitting the nodes in a tree based model is to reduce the entropy and increase the homogeneity!'
So, whenever we follow any method of split, it always increase the homogeneity with each split, and reduces the entropy!
. . .¶
Perhaps homogeneity and entropy is better understood using an example:
Say, in a class there are 30 students, and we have their data (gender, grades, height, weight...) and also if they play sports or not is given. Now if there's any new entry in the class, we'd want to predict if that student plays sports or not, based on the data that we have.
In this case, just take the gender attribute, if we split the data based on gender, then we automatically have males on one side and females in the other. Thus homogeneity is increased, and entropy in the data is reduced all at once!
Hence, this continues until the user specified criteria is met or there's no nodes to split further.
. . .¶
Ways of building the decision tree¶
Any version of the decision tree algorithm would result in a tree. The only difference between them is the way in which the tree is constructed. The different ways are selected based on the type of the target variable (discrete or continuous). The main way of constructing a tree is to split the attribute. There are four most widely used methods for the split.
- Information Gain
- Gini
- Chi-Square
- Reduction in Variance
Information Gain¶
Perhaps we can start with a question - 'How relevant is an attribute for the target class? How much information can we gain about the target class/attribute using the given attribute?' - that is Information Gain.
Information Gain is the increase or decrease in Entropy value when the node is split into sub-nodes!
Entropy for each sub-node -
$$H(s) \ / \ E(s) = -\sum_{x=1}^n P(x) \ log_2\ P(x)$$. . .¶
Weighted Entropy for the sub-nodes -
$$E(s) = \sum_{x=1}^n P(x) * Entropy(x)$$- Here, the entropy is the entropy of the sub-nodes.
- P(x) is the probability of the child node.
. . .¶
Information Gain by the target variable after the node is split -
$$IG(Y,X) = E(Y) - E(Y|X) $$- This can be read as - Information Gain from X on Y.
. . .¶
Information Gain by the parent node from the sub-nodes -
$$IG(X) = E(Parent) - E(x)(Weighted Entropy)$$- Based on the assumption that the parent node will always have an entropy of 1.
. . .¶
Tip 1: An alternative way of calculating $log_2 x$ from scientific calculator is to divide the $log_{10} x$ by $log_{10} 2$.
Tip 2: Both ways of evaluating the Information Gain is valid and will lead us to give the most relevant attributes. So yeah, enjoy!
Information Gain - Example¶
Question
There are 30 students from Class 9 and 10. Out of those 30 students - 14 are from Class 9 among which 6 play cricket and 16 are from Class 10 among which 9 play cricket. If we look at the gender count, 10 of them are female out of which 2 play cricket and 20 of them are male and 13 of them play cricket.
To find out the Entropy and Information Gain of the nodes.
Solution
Gender will have two sub-nodes - males and females. Total students are 30.
m = 20, $P(m)$ = 20/30 = 0.66
f = 10, $P(f)$ = 10/30 = 0.33
--
Entropy of Males
p = 13, $P(play)$ = 13/20 = 0.65
np = 7, $P(np)$ = 7/20 = 0.35
$$E(Males) = -P(p)log_2{P(p)} - P(np)log_2{P(np)}$$$$E(Males) = 0.40 + 0.53$$$$E(Males) = 0.93$$--
Entropy of Females
p = 2, $P(play)$ = 2/10 = 0.2
np = 8, $P(np)$ = 8/10 = 0.8
$$E(Females) = -P(p)log_2{P(p)} - P(np)log_2{P(np)}$$$$E(Females) = 0.463 + 0.257$$$$E(Females) = 0.72$$--
Weighted Entropy (Gender)
$$E(gender) = P(males)E(males) + P(females)E(females)$$$$E(gender) = 0.66 * 0.93 + 0.33 * 0.72$$$$E(gender) = 0.6138 + 0.2376$$$$E(gender) = 0.8514$$--
Similarly, when we calculate the weighted entropy of the split-class, then we get the following value.
$$E(class) = 0.99$$--
Now, let's find out the Information Gain w.r.t the parent node. We know that the entropy of the parent node is 1.
$$IG(parent) = 1 - E(subnodes)$$$$IG(gender) = 0.15$$$$IG(class) = 0.01$$--
We can see that the information gain of the gender attribute is much higher compared to the class attribute, so it's more relevant. Hence, gender can make as a good root node.
We can also find the IG w.r.t the entropy of the target variable but that's gonna be for tomorrow, I'm tired af.