Advanced Machine Learning with R
上QQ阅读APP看书,第一时间看更新

Classification tree

This exercise will be an excellent introduction to tree-based methods. I recommend applying this method to any supervised learning method because, at a minimum, you'll get a better understanding of the data and establish a good baseline of predictive performance. It may also be the only thing you need to do to solve a problem for your business partners. An example I can share was where the marketing team tasked me to try and reverse-engineer a customer segmentation done by an external vendor nearly two years in the past. We had the original survey data and the customer segment labels, but no understanding of how the data drove the segmentation.

Well, I just used the methods described in this section, and we could predict a segment with almost 100% accuracy. Plus, as you'll see, it was easy to explain why:

  1. Let's get the packages installed if needed:
library(magrittr)
install.packages("Boruta")
install.packages("caret")
install.packages("classifierplots")
install.packages("InformationValue")
install.packages("MLmetrics")
install.packages("randomForest")
install.packages("ROCR")
install.packages("rpart")
install.packages("rpart.plot")
install.packages("tidyverse")
install.packages("xgboost")
options(scipen=999)
  1. As for the simulated data, I discuss how I created it in Chapter 4, Advanced Feature Selection in Linear Models. You can find it on GitHub at this link: https://github.com/PacktPublishing/Advanced-Machine-Learning-with-R/blob/master/Data/sim_df.csv.
  2. We now load it into R:
> sim_df <- read.csv("~/sim_df.csv", stringsAsFactors = FALSE)
  1. The response we'll try and predict is called y. It's a numeric value of either 0 or 1 with 1 being the outcome of interest. It's slightly unbalanced, with about 30% of the responses labeled a 1. Let's confirm that and turn it into a factor, which will tell our tree function we're interested in classification:
 > table(sim_df$y)

0 1
7072 2928

> sim_df$y <- as.factor(sim_df$y)
  1. Create the train/test split using the same random seed as in Chapter 4, Advanced Feature Selection in Linear Models:
> set.seed(1066)

> index <- caret::createDataPartition(sim_df$y, p = 0.7, list = F)

> train <- sim_df[index, ]

> test <- sim_df[-index, ]
  1. To create our classification tree, we'll be using the rpart() function, using the common formula syntax:
> tree_fit <- rpart::rpart(y ~ ., data = train)
  1. The next thing I like to do is look at the cptable from our model object:
> tree_fit$cptable
CP nsplit rel error xerror xstd
1 0.20878049 0 1.0000000 1.0000000 0.01857332
2 0.19609756 1 0.7912195 0.7595122 0.01697342
3 0.01585366 2 0.5951220 0.6029268 0.01556234
4 0.01219512 6 0.5297561 0.5775610 0.01530000
5 0.01000000 8 0.5053659 0.5395122 0.01488626

This is an interesting table to analyze. The first column labeled CP is the cost complexity parameter. The second column, nsplit, is the number of splits in the tree. The rel error column stands for relative error. Both xerror and xstd are based on ten-fold cross-validation, with xerror being the average error and xstd the standard deviation of the cross-validation process. We can see that eight splits produced the lowest error on the full dataset and on cross-validation.

  1. You can examine this using plotcp():
> rpart::plotcp(tree_fit)

The output of the preceding actions is as follows:

The plot shows us the cross-validated relative error by tree size with the corresponding error bars. The horizontal line on the plot is the upper limit of the lowest standard error. Selecting a different tree size, say seven, you would create an object of your desired cp and prune the tree simply by specifying that object in the prune() function, or you can just give the function the cp number. It would look as follows:

> cp = tree_fit$cptable[4, 1]

> cp
[1] 0.01219512

> cp <- min(tree_fit$cptable[, 3])
# not run
# rpart::prune(tree_fit, cp = cp)
# Or
# rpart::prune(tree_fit, cp = 0.01219512)

You can plot and explore the tree in a number of different ways. I prefer the version from the rpart.plot package. There's an excellent vignette on how to use it at the following website:

http://www.milbo.org/rpart-plot/prp.pdf.

Here's the first one, type = 3 with extra = 2 (see the vignette for more options):

> rpart.plot::rpart.plot(
tree_fit,
type = 3,
extra = 2,
branch = .75,
under = TRUE
)

The output of the preceding command is as follows:

The preceding plot shows the feature at each split and the value related to that split. The first split in the tree is the feature TwoFactor1. If the value is less than -1.8 then those observations end up in a terminal node. In this version of the tree, 772 observations are in that node (because the feature value is less than -1.8, and 600 of those observations are labeled 1. So, you can say that the node probability is 78% (600/772) that an observation is a 1. Now, if the value is equal to or greater than -1.8, then it goes to the next feature to split, which is TwoFactor2, and so forth until all observations are in a terminal node. 

If you want to see all of those terminal node probabilities, a simple change to the syntax will suffice:

> rpart.plot::rpart.plot(
tree_fit,
type = 1,
extra = 6,
branch = .75,
under = TRUE
)

The output of the preceding command is as follows:

This different look shows the percentage of 1 in each terminal node and complements the preceding plot. If you want to see all of the rules leading for the nodes, you can run this:

> rpart.plot::rpart.rules(tree_fit)
y
0.10 when TwoFactor1 >= -1.75 & TwoFactor2 < 1.69 & Linear2 >= -0.70
0.17 when TwoFactor1 >= -0.85 & TwoFactor2 < 0.92 & Linear2 < -0.70
0.33 when TwoFactor1 >= -1.75 & TwoFactor2 is 1.69 to 2.20 & Linear2 >= 0.31
0.34 when TwoFactor1 is -1.75 to -0.85 & TwoFactor2 < 0.92 & Linear2 < -0.70 &
Linear3 < -0.14
0.63 when TwoFactor1 >= -1.75 & TwoFactor2 is 0.92 to 1.69 & Linear2 < -0.70
0.72 when TwoFactor1 is -1.75 to -0.85 & TwoFactor2 < 0.92 & Linear2 < -0.70 &
Linear3 >= -0.14
0.78 when TwoFactor1 < -1.75
0.80 when TwoFactor1 >= -1.75 & TwoFactor2 >= 2.20 & Linear2 >= 0.31
0.88 when TwoFactor1 >= -1.75 & TwoFactor2 >= 1.69 & Linear2 < 0.31

We shall now see how this simple model performs on the test set. You may recall that with elastic net, we had an area under the curve (AUC) of over 0.87 and a log-loss of 0.37:

> rparty.test <- predict(tree_fit, newdata = test)

> rparty.test <- rparty.test[, 2]

> classifierplots::density_plot(test$y, rparty.test)

The output of the preceding code is as follows:

Notice the spikes in the density. The plot is capturing the probabilities from those terminal nodes. The real test will be our two favorite metrics, AUC and log-loss:

> ynum <- as.numeric(ifelse(test$y == "1", 1, 0))

> MLmetrics::AUC(rparty.test, ynum)
[1] 0.8201691

> MLmetrics::LogLoss(rparty.test, ynum)
[1] 0.4140015

OK, the performance isn't as good as using elastic net and so on, but overall I don't believe that's too bad for such a simple model that even someone in marketing can understand. We'll see if complicating things with a random forest can surpass elastic net when we look at using it for feature selection. Our next task is to use random forest and boosted trees in a classification problem.