data:image/s3,"s3://crabby-images/ab03b/ab03b86061fc950414fe553f366b32df15ca196d" alt="Hands-On Automated Machine Learning"
MLBox
MLBox (http://mlbox.readthedocs.io/en/latest/) is another AutoML library and it supports distributed data processing, cleaning, formatting, and state-of-the-art algorithms such as LightGBM and XGBoost. It also supports model stacking, which allows you to combine an information ensemble of models to generate a new model aiming to have better performance than the individual models.
Here's an example of its usage:
# Necessary Imports
from mlbox.preprocessing import *
from mlbox.optimisation import *
from mlbox.prediction import *
import wget
file_link = 'https://apsportal.ibm.com/exchange-api/v1/entries/8044492073eb964f46597b4be06ff5ea/data?accessKey=9561295fa407698694b1e254d0099600'
file_name = wget.download(file_link)
print(file_name)
# GoSales_Tx_NaiveBayes.csv
The GoSales dataset contains information for customers and their product preferences:
import pandas as pd
df = pd.read_csv('GoSales_Tx_NaiveBayes.csv')
df.head()
You get the following output from the preceding code:
data:image/s3,"s3://crabby-images/a9d38/a9d38a856f49aa1fd931c745ef9512481e14ba38" alt=""
Let's create a test set from the same dataset by dropping a target column:
test_df = df.drop(['PRODUCT_LINE'], axis = 1)
# First 300 records saved as test dataset
test_df[:300].to_csv('test_data.csv')
paths = ["GoSales_Tx_NaiveBayes.csv", "test_data.csv"]
target_name = "PRODUCT_LINE"
rd = Reader(sep = ',')
df = rd.train_test_split(paths, target_name)
The output will be similar to the following:
data:image/s3,"s3://crabby-images/bd64c/bd64c9f110fc243e9510f4cfef0d1df998151858" alt=""
Drift_thresholder will help you to drop IDs and drifting variables between train and test datasets:
dft = Drift_thresholder()
df = dft.fit_transform(df)
You get the following output:
data:image/s3,"s3://crabby-images/b43b8/b43b81a5d788d67c2f6387e26290c84722621634" alt=""
Optimiser will optimize the hyperparameters:
opt = Optimiser(scoring = 'accuracy', n_folds = 3)
opt.evaluate(None, df)
You get the following output by running the preceding code:
data:image/s3,"s3://crabby-images/57f9e/57f9ec3c460451c561875d59798f242ed293a4ec" alt=""
The following code defines the parameters of the ML pipeline:
space = {
'ne__numerical_strategy':{"search":"choice", "space":[0]},
'ce__strategy':{"search":"choice",
"space":["label_encoding","random_projection", "entity_embedding"]},
'fs__threshold':{"search":"uniform", "space":[0.01,0.3]},
'est__max_depth':{"search":"choice", "space":[3,4,5,6,7]}
}
best = opt.optimise(space, df,15)
The following output shows you the selected methods that are being tested by being given the ML algorithms, which is LightGBM in this output:
data:image/s3,"s3://crabby-images/41721/41721c4b681e29a823c9e20d62d9db4f86e5cfb0" alt=""
You can also see various measures such as accuracy, variance, and CPU time:
data:image/s3,"s3://crabby-images/683de/683de5b27d31f2b5c4ebbb2c07239535e887d069" alt=""
Using Predictor, you can use the best model to make predictions:
predictor = Predictor()
predictor.fit_predict(best, df)
You get the following output:
data:image/s3,"s3://crabby-images/464e9/464e93f1d64f0415820bfa79206f6bab05757df5" alt=""