Featured
Decision Trees In Bytes
Photo by Fabrice Villard on Unsplash |
This the first post in Byte Series, where we are going to have discuss on overall concept but in brief to act a refresher.
Decision Trees:
As the name goes, a decision tree uses a tree-like model to make predictions. It resembles an upside-down tree.
- Decision trees are easy to interpret; you can always go back and identify the various factors leading to a decision.
- A decision tree requires you to perform tests on attributes in order to split the data into multiple partitions.
Types:
- Single and Multiway class:
- Classification and Regression:
Homogeneity:
We always try to generate partitions that result in homogeneous data points. For classification tasks, a data set is completely homogeneous if it contains only a single class label.
Gini Index
This build a homogeneity value for the dataset based on the values of probability of each label in the data.
Gini Index, also known as Gini impurity, calculates the amount of probability of a specific feature that is classified incorrectly when selected randomly. If all the elements are linked with a single class then it can be called pure.
Gini index varies between values 0 and 1, where 0 expresses the purity of classification, i.e. All the elements belong to a specified class or only one class exists there. And 1 indicates the random distribution of elements across various classes. The value of 0.5 of the Gini Index shows an equal distribution of elements over some classes.
Gini Index for Binary Target variable is
= 1 — P^2(Target=0) — P^2(Target=1)
Entropy:
If a data set is completely homogenous, then the entropy of such a data set is 0, i.e. there’s no disorder.
Gini index is measure of homogeneity ,Entropy is measure of Chaos.
- Predictions made by a decision tree are easily interpretable.
- A decision tree does not assume anything specific about the nature of the attributes in a data set. It can seamlessly handle all kinds of data — numeric, categorical, strings, Boolean, and so on.
- It does not require normalisation since it has to only compare the values within an attribute.
- Decision trees often give us an idea of the relative importance of the explanatory attributes that are used for prediction.
- Decision trees tend to overfit the data. If allowed to grow with no check on its complexity, a tree will keep splitting till it has correctly classified (or rather, mugged up) all the data points in the training set.
- Decision trees tend to be very unstable, which is an implication of overfitting. A few changes in the data can change a tree considerably.
There are various ways to truncate or prune trees, the DecisionTreeClassifier function in sklearn provides the following hyperparameters which you can control:
- criterion (Gini/IG or entropy): It defines the function to measure the quality of a split. Sklearn supports “gini” criteria for Gini Index & “entropy” for Information Gain. By default, it takes the value “gini”.
- max_features: It defines the no. of features to consider when looking for the best split. We can input integer, float, string & None value.
- If an integer is inputted then it considers that value as max features at each split.
- If float value is taken then it shows the percentage of features at each split.
- If “auto” or “sqrt” is taken then max_features=sqrt(n_features).
- If “log2” is taken then max_features= log2(n_features).
- If None, then max_features=n_features. By default, it takes “None” value.
- max_depth: The max_depth parameter denotes maximum depth of the tree. It can take any integer value or None. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples. By default, it takes “None” value.
- min_samples_split: This tells above the minimum no. of samples required to split an internal node. If an integer value is taken then consider min_samples_split as the minimum no. If float, then it shows percentage. By default, it takes “2” value.
- min_samples_leaf: The minimum number of samples required to be at a leaf node. If an integer value is taken then consider - -min_samples_leaf as the minimum no. If float, then it shows percentage. By default, it takes “1” value.
# Importing decision tree classifier from sklearn library from sklearn.tree import DecisionTreeClassifier # Fitting the decision tree with default hyperparameters, apart from # max_depth which is 5 so that we can plot and read the tree. dt_default = DecisionTreeClassifier(max_depth=5) dt_default.fit(X_train, y_train)
# GridSearchCV to find optimal max_depth from sklearn.model_selection import KFold from sklearn.model_selection import GridSearchCV # specify number of folds for k-fold CV n_folds = 5 # parameters to build the model on parameters = {'max_depth': range(1, 40)} # instantiate the model dtree = DecisionTreeClassifier(criterion = "gini", random_state = 100) # fit tree on training data tree = GridSearchCV(dtree, parameters, cv=n_folds, scoring="accuracy") tree.fit(X_train, y_train)
scores= tree.cv_results
# plotting accuracies with max_depth plt.figure() plt.plot(scores["param_max_depth"], scores["mean_train_score"], label="training accuracy") plt.plot(scores["param_max_depth"], scores["mean_test_score"], label="test accuracy") plt.xlabel("max_depth") plt.ylabel("Accuracy") plt.legend() plt.show()
# Create the parameter grid param_grid = { 'max_depth': range(5, 15, 5), 'min_samples_leaf': range(50, 150, 50), 'min_samples_split': range(50, 150, 50), 'criterion': ["entropy", "gini"] } n_folds = 5 # Instantiate the grid search model dtree = DecisionTreeClassifier() grid_search = GridSearchCV(estimator = dtree, param_grid = param_grid, cv = n_folds, verbose = 1) # Fit the grid search to the data grid_search.fit(X_train,y_train)
# printing the optimal accuracy score and hyperparameters print("best accuracy", grid_search.best_score_) # model with optimal hyperparameters clf_gini = DecisionTreeClassifier(criterion = "gini", random_state = 100, max_depth=10, min_samples_leaf=50, min_samples_split=50) clf_gini.fit(X_train, y_train) # accuracy score clf_gini.score(X_test,y_test)
Comments
Post a Comment