上QQ阅读APP看书，第一时间看更新

SVM for churn prediction

SVM is also used widely for large-scale classification (that is, binary as well as multinomial) tasks. Besides, it is also a linear ML method, as described in Chapter 1, Analyzing Insurance Severity Claim. The linear SVM algorithm outputs an SVM model, where the loss function used by SVM can be defined using the hinge loss, as follows:

L(w;x,y):=max{0,1−yw^Tx}

The linear SVMs in Spark are trained with an L2 regularization, by default. However, it also supports L1 regularization, by which the problem itself becomes a linear program.

Now, suppose we have a set of new data points x; the model makes predictions based on the value of w^Tx. By default, if w^Tx≥0, then the outcome is positive, and negative otherwise.

Now that we already know the SVMs working principle, let's start using the Spark-based implementation of SVM. Let's start by importing the required packages and libraries:

import org.apache.spark._
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import org.apache.spark.ml.classification.{LinearSVC, LinearSVCModel}
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.max
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.tuning.{ParamGridBuilder, CrossValidator}
import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator

Now let's create a Spark session and import implicit:

val spark: SparkSession = SparkSessionCreate.createSession("ChurnPredictionLogisticRegression")
import spark.implicits._

We now need to define some hyperparameters to train an LR-based pipeline:

val numFolds = 10
val MaxIter: Seq[Int] = Seq(100)
val RegParam: Seq[Double] = Seq(1.0) // L2 regularization param, set 0.10 with L1 reguarization
val Tol: Seq[Double] = Seq(1e-8)
val ElasticNetParam: Seq[Double] = Seq(1.0) // Combination of L1 and L2

Now, once we have the hyperparameters defined and initialized, the next task is to instantiate an LR estimator, as follows:

val svm = new LinearSVC()

Now that we have three transformers and an estimator ready, the next task is to chain in a single pipeline—that is, each of them acts as a stage:

val pipeline = new Pipeline()
     .setStages(Array(PipelineConstruction.ipindexer,
                      PipelineConstruction.labelindexer,
                      PipelineConstruction.assembler,svm)
                      )

Let's define the paramGrid to perform such a grid search over the hyperparameter space. This searches through SVM's max iteration, regularization param, tolerance, and Elastic Net for the best model:

val paramGrid = new ParamGridBuilder()
    .addGrid(svm.maxIter, MaxIter)
    .addGrid(svm.regParam, RegParam)
    .addGrid(svm.tol, Tol)
    .addGrid(svm.elasticNetParam, ElasticNetParam)
    .build()

Let's define a BinaryClassificationEvaluator evaluator to evaluate the model:

val evaluator = new BinaryClassificationEvaluator()
    .setLabelCol("label")
    .setRawPredictionCol("prediction")

We use a CrossValidator for performing 10-fold cross-validation for best model selection:

val crossval = new CrossValidator()
    .setEstimator(pipeline)
    .setEvaluator(evaluator)
    .setEstimatorParamMaps(paramGrid)
    .setNumFolds(numFolds)

Let's now call the fit method so that the complete predefined pipeline, including all feature preprocessing and the LR classifier, is executed multiple times—each time with a different hyperparameter vector:

val cvModel = crossval.fit(Preprocessing.trainDF)

Now it's time to evaluate the predictive power of the SVM model on the test dataset. As a first step, we need to transform the test set with the model pipeline, which will map the features according to the same mechanism we described in the preceding feature engineering step:

val predictions = cvModel.transform(Preprocessing.testSet)
prediction.show(10)
>>>

However, seeing the previous prediction DataFrame, it is really difficult to guess the classification accuracy. In the second step, the evaluator evaluates itself using BinaryClassificationEvaluator, as follows:

val accuracy = evaluator.evaluate(predictions)
println("Classification accuracy: " + accuracy)
>>>
Classification accuracy: 0.7530180345969819

So we get about 75% of classification accuracy from our binary classification model. Now, using the accuracy for the binary classifier does not make enough sense.

Hence, researchers often recommend other performance metrics, such as area under the precision-recall curve and area under the ROC curve. However, for this we need to construct an RDD containing the raw scores on the test set:

val predictionAndLabels = predictions
    .select("prediction", "label")
    .rdd.map(x => (x(0).asInstanceOf[Double], x(1)
    .asInstanceOf[Double]))

Now the preceding RDD can be used to compute the two previously-mentioned performance metrics:

val metrics = new BinaryClassificationMetrics(predictionAndLabels)
println("Area under the precision-recall curve: " + metrics.areaUnderPR)
println("Area under the receiver operating characteristic (ROC) curve : " + metrics.areaUnderROC)
>>>
Area under the precision-recall curve: 0.5595712265324828
Area under the receiver operating characteristic (ROC) curve: 0.7530180345969819

In this case, the evaluation returns 75% accuracy but only 55% precision. In the following, we again calculate some more metrics; for example, false and true positive and negative predictions are also useful to evaluate the model's performance:

val lp = predictions.select("label", "prediction")
val counttotal = predictions.count()

val correct = lp.filter($"label" === $"prediction").count()

val wrong = lp.filter(not($"label" === $"prediction")).count()
val ratioWrong = wrong.toDouble / counttotal.toDouble

val ratioCorrect = correct.toDouble / counttotal.toDouble

val truep = lp.filter($"prediction" === 0.0).filter($"label" ===
$"prediction").count() / counttotal.toDouble

val truen = lp.filter($"prediction" === 1.0).filter($"label" ===
$"prediction").count() / counttotal.toDouble

val falsep = lp.filter($"prediction" === 1.0).filter(not($"label" ===
$"prediction")).count() / counttotal.toDouble

val falsen = lp.filter($"prediction" === 0.0).filter(not($"label" ===
$"prediction")).count() / counttotal.toDouble

println("Total Count : " + counttotal)
println("Correct : " + correct)
println("Wrong: " + wrong)
println("Ratio wrong: " + ratioWrong)
println("Ratio correct: " + ratioCorrect)
println("Ratio true positive : " + truep)
println("Ratio false positive : " + falsep)
println("Ratio true negative : " + truen)
println("Ratio false negative : " + falsen)
>>>

Yet, we have not received good accuracy using SVM. Moreover, there is no option to select the most suitable features, which would help us train our model with the most appropriate features. This time , we will again use a more robust classifier, such as the decision trees (DTs) implementation from the Apache Spark ML package.