Want to read this story later? Save it in Journal.
Until a prediction or “leaf” is reached at the bottom of the tree, each node can be thought of as an if statement. In the root node(box at the top of the tree), the algorithm of the hypothetical tree determined that the ideal feature is weight and threshold is 300 pounds. In other words, a person being at least, or under, 300 pounds was the most ideal feature value to split on with regards to predicting their diabetes status. Of those who weigh at least 300 pounds, the tree then splits on an age of 60 years old. It then found no other ideal feature value splits. The hypothetical tree found that people who were seen to be at least 300 pounds AND at least 60 years old, most had diabetes. The model would therefore predict that someone has diabetes if they weigh 350 pounds and are 65 years old. Going down another route of the tree, someone who weighs 200 pounds and exercises 1 hour a week would be predicted to not have diabetes. The idea is that we could now input the expected characteristics of any person (weight, age, exercise, family history) and the trained classification tree will output their predicted diabetes status.
Now that you understand a bit of what to expect, let’s jump into how everything is calculated and what’s actually happening under the hood.
Calculating Ideal Splits
The root node pertains to all of the training data. To create splits and therefore subsets in the data, there needs to be a metric to measure the quality of potential splits. This metric generally measures the homogeneity of the class values in the data. I used Gini impurity as my metric. Wikipedia’s Decision Tree article describes Gini impurity as “a measure of how often a randomly chosen element from the set would be incorrectly labeled if it was randomly labeled according to the distribution of labels in the subset.”
n is the number of samples and Pk² is the proportion of observations of class k
To see how that works in practice, let’s think about of Gini impurity in terms of the hypothetical example tree:
- The Gini impurity of all the training data, assuming a 50/50 split of people with and without diabetes, will be 0.5. In other words, predicting according to the hypothetically measured distribution of diabetes status then we will be wrong 50% of the time.
- If it was instead a 75/25 split then the Gini impurity is 0.375 and we would expect to be wrong 37.5% of the time.
- Since Gini impurity measures how often a prediction would be mislabeled, we want to minimize it to find the best split
- Calculate the Gini impurity of all the data coming into the parent node, like we did in the bullet points above
- To judge the quality of a potential split, calculate the Gini impurity of the 2 sub-datasets created by the split, and then compare that to the Gini impurity of the parent node
- To find the ideal split, calculate the Gini impurity of all potential splits and choose the smallest one (as long as it’s smaller than the Gini impurity of the parent node)
This is calculated for the root node (all the data) and then each subsequent subset of data created by each split. The splitting continues until the maximum depth is reached or if the Gini impurity of the child nodes is not smaller than that of their parent node.
There are a myriad of potential situations where a Decision Tree Classifier could be useful. Classification trees apply to almost any case where you want to predict what something is. Whether that’s if someone has diabetes, or the breed of a dog, or the weather. Still, this method has distinct advantages and disadvantages.
- Highly interpretable
- Little need for data pre-processing or feature engineering
- Can be made more accurate (over-fit less) by using an ensemble technique to create a random forest
- Susceptible to time and memory constraints
- Sensitive to changes in the training data
- Can easily over-fit without proper pruning
First, we need to create a
Node class so we can store information about any splits:
- Predicted class
- Feature index
We also want to know if a particular node has child nodes and what those child nodes are:
To facilitate printing and testing I also decided to save details about the position of the node in relation to its parents, as well how deep the node is within the tree:
- Left branch
- Right branch
Decision Tree Classifier Class
Next, we create a
DecisionTreeClassifier class. For now it only needs an attribute for the max depth of the tree and an attribute for the tree itself (the root node). Later, we will need to add methods so we can fit on training data and predict on new data.
fit() method of the Decision Tree Classifier class is fairly complex and I broke it down with several helper methods. The method itself needs to save attributes for the number of classes and features in the whole dataset, as well as save the tree after creating it. It also expects an X and y for the training data features and classes.
Find Split Method
Before going into the
grow_tree() method, it’s important to look at another helper method that it uses, the
find_split() method. The find split method is used to find the ideal feature and associated feature value to split on (ideal column and ideal threshold). This is done in order to minimize the Gini impurity of the child nodes. It is also the most complex method in this post so I will break it up into two parts.
Find Split Part 1
- Instantiate the ideal column and ideal threshold variables as None. These are the variables returned by
- Gather number of observations and count of each unique class value to calculate Gini impurity of current node
- Check that there’s at least 2 observations so a split is possible, else return
- Calculate current node’s Gini impurity
*All calls of the
reshape method are used to facilitate future operations using NumPy*
Find Split Part 2
- Loop through the X feature columns
- Combine the X (a single column) and y data into a single matrix and sort them by the X value
- Separate them again. The X values are now all the possible thresholds and the y values are the observed class of each row of data
- Keep track of how many observations of each class are in the left and right node using
- Loop through the index of every observation in the data
- Find the Gini impurity of the child nodes for each potential split and check if it’s smaller than the Gini impurity of the parent node. If it is, save it as the new
This method of finding the Gini impurity of potential splits is very efficient and you can be read about it more from Joachim Valente on Medium, here. The key is that as we loop through the sorted feature values, we keep track of the changing number of observations per class on the left and the right.
Grow Tree Method
The grow tree method takes in the X and y data and builds the entire decision tree. It does this by recursively creating each node, ideal split (using
find_split()), and assigning the node attributes:
Last of all is the predict method. After using the fit method, you can put new data in with this method and see what the Decision Tree Classifier predicts the target class values(s) would be. The method looks at each row of data that gets input and navigates through the tree using
while node.left:. We only need to use
while node.left: because a classification tree will never have a node with only one child node. If there are no child nodes (when
node.left == None) then we are at the correct leaf and we can append the predicted class to the list of predictions.
To compare my Decision Tree Classifier class to scikit-learn’s, I used scikit-learn’s built in wine toy dataset. I built trees with my class and with scikit-learn’s, then compared tree structures and test accuracies.
Here is how my tree looked (you can see my print function in my code repository).:
Here is how scikit-learn’s tree looked: