Hands-On Automated Machine Learning
上QQ阅读APP看书,第一时间看更新

MLBox

MLBox (http://mlbox.readthedocs.io/en/latest/) is another AutoML library and it supports distributed data processing, cleaning, formatting, and state-of-the-art algorithms such as LightGBM and XGBoost. It also supports model stacking, which allows you to combine an information ensemble of models to generate a new model aiming to have better performance than the individual models.

Here's an example of its usage:

# Necessary Imports
from mlbox.preprocessing import *
from mlbox.optimisation import *
from mlbox.prediction import *
import wget

file_link = 'https://apsportal.ibm.com/exchange-api/v1/entries/8044492073eb964f46597b4be06ff5ea/data?accessKey=9561295fa407698694b1e254d0099600'
file_name = wget.download(file_link)

print(file_name)
# GoSales_Tx_NaiveBayes.csv

The GoSales dataset contains information for customers and their product preferences:

import pandas as pd
df = pd.read_csv('GoSales_Tx_NaiveBayes.csv')
df.head()

You get the following output from the preceding code: 

Let's create a test set from the same dataset by dropping a target column:

test_df = df.drop(['PRODUCT_LINE'], axis = 1)

# First 300 records saved as test dataset
test_df[:300].to_csv('test_data.csv')

paths = ["GoSales_Tx_NaiveBayes.csv", "test_data.csv"]
target_name = "PRODUCT_LINE"

rd = Reader(sep = ',')
df = rd.train_test_split(paths, target_name)

The output will be similar to the following: 

Drift_thresholder will help you to drop IDs and drifting variables between train and test datasets:

dft = Drift_thresholder()
df = dft.fit_transform(df)

You get the following output:

Optimiser will optimize the hyperparameters:

opt = Optimiser(scoring = 'accuracy', n_folds = 3)
opt.evaluate(None, df)

You get the following output by running the preceding code:

The following code defines the parameters of the ML pipeline:

space = {  
'ne__numerical_strategy':{"search":"choice", "space":[0]},
'ce__strategy':{"search":"choice",
"space":["label_encoding","random_projection", "entity_embedding"]},
'fs__threshold':{"search":"uniform", "space":[0.01,0.3]},
'est__max_depth':{"search":"choice", "space":[3,4,5,6,7]}
}

best = opt.optimise(space, df,15)

The following output shows you the selected methods that are being tested by being given the ML algorithms, which is LightGBM in this output:

You can also see various measures such as accuracy, variance, and CPU time:

Using Predictor, you can use the best model to make predictions:

predictor = Predictor()
predictor.fit_predict(best, df)

You get the following output: