 # Building a Decision Tree Classifier | by Chris Jakuc | Jul, 2020 | Noteworthy

Want to read this story later? Save it in Journal.

Until a prediction or “leaf” is reached at the bottom of the tree, each node can be thought of as an if statement. In the root node(box at the top of the tree), the algorithm of the hypothetical tree determined that the ideal feature is weight and threshold is 300 pounds. In other words, a person being at least, or under, 300 pounds was the most ideal feature value to split on with regards to predicting their diabetes status. Of those who weigh at least 300 pounds, the tree then splits on an age of 60 years old. It then found no other ideal feature value splits. The hypothetical tree found that people who were seen to be at least 300 pounds AND at least 60 years old, most had diabetes. The model would therefore predict that someone has diabetes if they weigh 350 pounds and are 65 years old. Going down another route of the tree, someone who weighs 200 pounds and exercises 1 hour a week would be predicted to not have diabetes. The idea is that we could now input the expected characteristics of any person (weight, age, exercise, family history) and the trained classification tree will output their predicted diabetes status.

Now that you understand a bit of what to expect, let’s jump into how everything is calculated and what’s actually happening under the hood.

The root node pertains to all of the training data. To create splits and therefore subsets in the data, there needs to be a metric to measure the quality of potential splits. This metric generally measures the homogeneity of the class values in the data. I used Gini impurity as my metric. Wikipedia’s Decision Tree article describes Gini impurity as “a measure of how often a randomly chosen element from the set would be incorrectly labeled if it was randomly labeled according to the distribution of labels in the subset.”

n is the number of samples and Pk² is the proportion of observations of class k

To see how that works in practice, let’s think about of Gini impurity in terms of the hypothetical example tree:

• The Gini impurity of all the training data, assuming a 50/50 split of people with and without diabetes, will be 0.5. In other words, predicting according to the hypothetically measured distribution of diabetes status then we will be wrong 50% of the time.
• If it was instead a 75/25 split then the Gini impurity is 0.375 and we would expect to be wrong 37.5% of the time.
• Since Gini impurity measures how often a prediction would be mislabeled, we want to minimize it to find the best split
• Calculate the Gini impurity of all the data coming into the parent node, like we did in the bullet points above
• To judge the quality of a potential split, calculate the Gini impurity of the 2 sub-datasets created by the split, and then compare that to the Gini impurity of the parent node
• To find the ideal split, calculate the Gini impurity of all potential splits and choose the smallest one (as long as it’s smaller than the Gini impurity of the parent node)

This is calculated for the root node (all the data) and then each subsequent subset of data created by each split. The splitting continues until the maximum depth is reached or if the Gini impurity of the child nodes is not smaller than that of their parent node.

There are a myriad of potential situations where a Decision Tree Classifier could be useful. Classification trees apply to almost any case where you want to predict what something is. Whether that’s if someone has diabetes, or the breed of a dog, or the weather. Still, this method has distinct advantages and disadvantages.

• Highly interpretable
• Little need for data pre-processing or feature engineering
• Can be made more accurate (over-fit less) by using an ensemble technique to create a random forest
• Susceptible to time and memory constraints
• Sensitive to changes in the training data
• Can easily over-fit without proper pruning

First, we need to create a `Node` class so we can store information about any splits:

• Predicted class
• Feature index
• Threshold

We also want to know if a particular node has child nodes and what those child nodes are:

To facilitate printing and testing I also decided to save details about the position of the node in relation to its parents, as well how deep the node is within the tree:

• Left branch
• Right branch
• Depth https://avasta.ch