
The random forest model
Now, let's run a random forest model. We will grow 2,000 trees and just for illustration include Age, Pclass (Passenger Class), and Fare as the independent variables. Random forest randomizes both the observations selected as well as a sample number of observations selected, so you are never certain which specific trees you will get. You might even get one of the trees just generated in the previous example!
library(randomForest)
set.seed(123)
fit <- randomForest(as.factor(Survived) ~ Age + Pclass + Fare,
data=titanic,
importance=TRUE,
ntree=2000)
Random forest also has a predict function, with a similar syntax to the predict function we used in Chapter 1, Getting Started with Predictive Analytics. We will use this function to generate the predictions:
prediction.rf <- predict(fit, titanic)
Once the predictions are generated, we can construct a dataframe consisting of the predictions along with the actual survival outcomes obtained from the raw data:
x<-data.frame(predict.rf=as.factor(prediction.rf),survived=titanic$Survived)
Now we can run a simple table() function which will count the number of actual outcomes classified by their predicted values.
table(x$predict.rf,x$survived)
This is the following output:
0 1
0 384 118
1 40 172
The numbers in the table reflect the following predictions:
(Row 1,Column 1) Passenger predicted NOT to survive & DID NOT survive
(Row 1,Column 2) Passenger predicted NOT to survive DID survive
(Row 2,Column 1) Passenger predicted to survive and DID NOT survive
(Row 2,Column 2) Passenger predicted to survive and DID survive
Based on that, we can see that we have made correct predictions for the counts contained in (Row 1,Column 1) and (Row 2,Column 2), since our predictions agree with the outcomes.
To get the total number of predictions, we will add up the total number of correct counts, and then divide it by the number of rows. We can do the math at the console:
(384+172) / nrow(titanic)
This is the following output:
[1] 0.7787115
Using a random forest, the correct predictions rate has been raised to 77%.