Statistics for Machine Learning
上QQ阅读APP看书,第一时间看更新

Grid search

Grid search in machine learning is a popular way to tune the hyperparameters of the model in order to find the best combination for determining the best fit:

In the following code, implementation has been performed to determine whether a particular user will click an ad or not. Grid search has been implemented using a decision tree classifier for classification purposes. Tuning parameters are the depth of the tree, the minimum number of observations in terminal node, and the minimum number of observations required to perform the node split:

# Grid search 
>>> import pandas as pd 
>>> from sklearn.tree import DecisionTreeClassifier 
>>> from sklearn.model_selection import train_test_split 
>>> from sklearn.metrics import classification_report,confusion_matrix,accuracy_score 
>>> from sklearn.pipeline import Pipeline 
>>> from sklearn.grid_search import GridSearchCV 
 
>>> input_data = pd.read_csv("ad.csv",header=None)                        
 
>>> X_columns = set(input_data.columns.values) 
>>> y = input_data[len(input_data.columns.values)-1] 
>>> X_columns.remove(len(input_data.columns.values)-1) 
>>> X = input_data[list(X_columns)] 

Split the data into train and testing:

>>> X_train, X_test,y_train,y_test = train_test_split(X,y,train_size = 0.7,random_state=33) 

Create a pipeline to create combinations of variables for the grid search:

>>> pipeline = Pipeline([ 
...      ('clf', DecisionTreeClassifier(criterion='entropy')) ]) 

Combinations to explore are given as parameters in Python dictionary format:

>>> parameters = { 
...      'clf__max_depth': (50,100,150), 
...      'clf__min_samples_split': (2, 3), 
...      'clf__min_samples_leaf': (1, 2, 3)} 

The n_jobs field is for selecting the number of cores in a computer; -1 means it uses all the cores in the computer. The scoring methodology is accuracy, in which many other options can be chosen, such as precision, recall, and f1:

>>> grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1, scoring='accuracy') 
>>> grid_search.fit(X_train, y_train)  

Predict using the best parameters of grid search:

>>> y_pred = grid_search.predict(X_test) 

The output is as follows:

>>> print ('\n Best score: \n', grid_search.best_score_) 
>>> print ('\n Best parameters set: \n')   
>>> best_parameters = grid_search.best_estimator_.get_params() 
>>> for param_name in sorted(parameters.keys()): 
>>>     print ('\t%s: %r' % (param_name, best_parameters[param_name])) 
>>> print ("\n Confusion Matrix on Test data \n",confusion_matrix(y_test,y_pred)) 
>>> print ("\n Test Accuracy \n",accuracy_score(y_test,y_pred)) 
>>> print ("\nPrecision Recall f1 table \n",classification_report(y_test, y_pred)) 

The R code for grid searches on decision trees is as follows:

# Grid Search on Decision Trees 
library(rpart) 
input_data = read.csv("ad.csv",header=FALSE) 
input_data$V1559 = as.factor(input_data$V1559) 
set.seed(123) 
numrow = nrow(input_data) 
trnind = sample(1:numrow,size = as.integer(0.7*numrow)) 
 
train_data = input_data[trnind,];test_data = input_data[-trnind,] 
minspset = c(2,3);minobset = c(1,2,3) 
initacc = 0 
 
for (minsp in minspset){ 
  for (minob in minobset){ 
    tr_fit = rpart(V1559 ~.,data = train_data,method = "class",minsplit = minsp, minbucket = minob) 
    tr_predt = predict(tr_fit,newdata = train_data,type = "class") 
    tble = table(tr_predt,train_data$V1559) 
    acc = (tble[1,1]+tble[2,2])/sum(tble) 
    acc 
    if (acc > initacc){ 
      tr_predtst = predict(tr_fit,newdata = test_data,type = "class") 
      tblet = table(test_data$V1559,tr_predtst) 
      acct = (tblet[1,1]+tblet[2,2])/sum(tblet) 
      acct 
      print(paste("Best Score")) 
      print( paste("Train Accuracy ",round(acc,3),"Test Accuracy",round(acct,3))) 
      print( paste(" Min split ",minsp," Min obs per node ",minob)) 
      print(paste("Confusion matrix on test data")) 
      print(tblet) 
      precsn_0 = (tblet[1,1])/(tblet[1,1]+tblet[2,1]) 
      precsn_1 = (tblet[2,2])/(tblet[1,2]+tblet[2,2]) 
      print(paste("Precision_0: ",round(precsn_0,3),"Precision_1: ",round(precsn_1,3))) 
      rcall_0 = (tblet[1,1])/(tblet[1,1]+tblet[1,2]) 
      rcall_1 = (tblet[2,2])/(tblet[2,1]+tblet[2,2]) 
      print(paste("Recall_0: ",round(rcall_0,3),"Recall_1: ",round(rcall_1,3))) 
      initacc = acc 
    } 
  } 
}