Hands-On Automated Machine Learning
上QQ阅读APP看书,第一时间看更新

By which method can linear regression be implemented?

We can create a linear regression model in Python by using scikit-learn's LinearRegression method. As this is the first instance where we are going to discuss implementing a model using Python, we will take a detour from our discussion of the algorithm and learn some essential packages that are required to create a model in Python:

  • numpy: It is a numeric Python module used for mathematical functions. It provides robust data structures for effective computation of multi-dimensional arrays and matrices.
  • pandas: It provides the DataFrame object for data manipulation. A DataFrame can hold different types of values and arrays. It is used to read, write, and manipulate data in Python.
  • scikit-learn: It is an ML library in Python. It includes various ML algorithms and is a widely used library for creating ML models in Python. Apart from ML algorithms, it also provides various other functions that are required to develop models, such as train_test_split, model evaluation metrics, and optimization metrics.

We need to first import these required libraries into the Python environment before creating a model. If you are running your code in a Jupyter notebook, it is necessary to declare %matplotlib inline to view the graph inline in the interface. We need to import the numpy and pandas packages for easy data manipulation and numerical calculations. The plan for this exercise is to create a linear regression model, so we need to also import the LinearRegression method from the scikit-learn package. We will use scikit-learn's example Boston dataset for the task:

%matplotlib inline
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
from sklearn.datasets import load_boston

Next, we need to load the Boston dataset using the following command. It is a dictionary, and we can examine its keys to view its content:

boston_data = load_boston()
boston_data.keys()

The output of the preceding code is as follows:

The boston_data has four keys that are self-explanatory on the kinds of values they point. We can retrieve the data and the target values from the keys data and target. The feature_names key holds the names of the attribute and DESCR has the description of each attribute. 

It is always good practice to look at the data size first before processing the data. This helps to decide whether to go with the full data or use a sample of it, and also to infer how long it might take to execute.

The data.shape function in Python is an excellent way to view the data dimensions (rows and columns):

print(" Number of rows and columns in the data set ", boston_data.data.shape)
print(boston_data.feature_names)

The output of the preceding code is as follows:

Next, we need to convert the dictionary to a DataFrame. This can be accomplished by calling the DataFrame function of the pandas library. We use head() to display a subset of records to validate the data:

boston_df =pd.DataFrame(boston_data.data)
boston_df.head()
A DataFrame is a collection of vectors and can be treated as a two-dimensional table. We can consider DataFrame as having each row correspond to some observation and each column to some attribute of the observation. This makes them extremely useful for fitting to a ML modeling task.

The output of the preceding code is as follows:

The column names are just numeric indexes and don't give a sense of what the DataFrame implies. So, let us assign the feature_names as the column names to the boston_df DataFrame to have meaningful names:

boston_df.columns = boston_data.feature_names

Once again, we check a sample of boston house rent data, and now it describes the columns better than previously: 

boston_df.head()

The output of the preceding code is as follows:

In linear regression, there has to be a DataFrame as a target variable and another DataFrame with other features as predictors. The objective of this exercise is to predict the house prices, so we assign PRICE as the target attribute (Y) and the rest all as predictors (X). The PRICE is dropped from the predictor list using the drop function. 

Next, we print the intercept and coefficients of each variable. The coefficients determine the weight and contribution that each predictor has on predicting the house price (target Y). The intercept provides a constant value, which we can consider to be house price when all of the predictors are absent:

boston_df['PRICE'] = boston_data.target
X = boston_df.drop('PRICE', axis=1)
lm = LinearRegression()
lm.fit(X, boston_df.PRICE)
print("Intercept: ", lm.intercept_)
print("Coefficient: ", lm.coef_)

The output of the preceding code is as follows:

It is not clear from the earlier screenshot which coefficient belongs to what predictors. So, we tie the features and coefficients together using the following code:

pd.DataFrame(list(zip(X.columns, lm.coef_)),columns= ['features','estimatedCoefficients'])

The output of the preceding code is as follows:

Next, we calculate and view the mean squared error metric. For now, let us think of it as the average error the model has in predicting the house price. The evaluation metrics are very important for understanding the dynamics of a model and how it is going to perform in a production environment:

lm.predict(X)[0:5
mseFull = np.mean((boston_df.PRICE - lm.predict(X)) ** 2)
print(mseFull)

The output of the preceding code is as follows:

We created the model on the whole dataset, but it is essential to ensure that the model we developed works appropriately on different datasets when used in a real production environment. For this reason, the data used for modeling is split into two sets, typically in a ratio of 70:30. The most significant split is used to train the model, and the other one is used to test the model developed. This independent test dataset is considered as a dummy production environment as this was hidden from the model during its training phase. The test dataset is used to generate the predictions and evaluate the accuracy of the model. Scikit-learn provides a train_test_split method that can be used to split the dataset into two parts. The test_size parameter in the function indicates the percentage of data that is to be held for testing. In the following code, we split the dataset into train and test sets, and retrain the model:

#Train and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, boston_df.PRICE, test_size=0.3, random_state=42)
print(X_train)

As we have used test_size=0.3, 70% of the dataset will be used for creating train set, and 30% will be reserved for the test dataset. We follow the same steps as earlier to create a linear regression model, but now we would use only the training dataset (X_train and Y_train) to create the model:

lm_tts = LinearRegression()
lm_tts.fit(X_train, Y_train)
print("Intercept: ", lm_tts.intercept_)
print("Coefficient: ", lm_tts.coef_)

The output of the preceding code is as follows:

We predict the target values for both the train and test datasets, and calculate their mean squared error (MSE):

pred_train = lm.predict(X_train)
pred_test = lm.predict(X_test)
print("MSE for Y_train:", np.mean((Y_train - lm.predict(X_train)) ** 2))
print("MSE with Y_test:", np.mean((Y_test - lm.predict(X_test)) ** 2))

The output of the preceding code is as follows:

We see that the MSE for both the train and test datasets are 22.86 and 19.65, respectively. This means the model's performance is almost similar in both the training and testing phase and can be deployed for predicting house prices on new independent identical datasets.

Next, let's paint a residual plot to see whether the residuals follow a linear pattern:

plt.scatter(pred_train,pred_train - Y_train, c = 'b',s=40,alpha=0.5)
plt.scatter(pred_test,pred_test - Y_test, c = 'r',s=40,alpha=0.7)
plt.hlines(y = 0, xmin=0, xmax = 50)
plt.title('Residual Plot - training data (blue) and test data(green)')
plt.ylabel('Residuals')

The output of the preceding code is as follows:

As the residuals are symmetrically distributed around the horizontal dashed line, then they exhibit a perfect linear pattern.

Developing a model is easy, but designing a useful model is difficult. Evaluating the performance of a ML model is a crucial step in an ML pipeline. Once a model is ready, we have to assess it to establish its correctness. In the following section, we will walk you through some of the widely-used evaluation metrics employed to evaluate a regression model.