Deep Learning for Beginners
上QQ阅读APP看书,第一时间看更新

Real-valued data and univariate regression

Knowing how to deal with categorical data is very important when using classification models based on deep learning; however, knowing how to prepare data for regression is as important. Data that contains continuous-like real values, such as temperature, prices, weight, speed, and others, is suitable for regression; that is, if we have a dataset with columns of different types of values, and one of those is real-valued data, we could perform regression on that column. This implies that we could use all the rest of the dataset to predict the values on that column. This is known as univariate regression, or regression on one variable.

Most machine learning methodologies work better if the data for regression is normalized. By that, we mean that the data will have special statistical properties that will make calculations more stable. This is critical for many deep learning algorithms that suffer from vanishing or exploding gradients (Hanin, B. (2018)). For example, in calculating a gradient in a neural network, an error needs to be propagated backward from the output layer to the input layer; but if the output layer has a large error and the range of values (that is their distribution) is also large, then the multiplications going backward can cause overflow on variables, which would ruin the training process. 

To overcome these difficulties, it is desirable to normalize the distribution of variables that can be used for regression, or variables that are real-valued. The normalization process has many variants, but we will limit our discussion to two main methodologies, one that sets specific statistical properties of the data, and one that sets specific ranges on the data.

Scaling to a specific range of values

Let's go back to the heart disease dataset discussed earlier in this chapter. If you pay attention, many of those variables are real-valued and would be ideal for regression; for example, x5 and x10

All variables are suitable for regression. This means that, technically, we can predict on any numeric data. The fact that some values are real-valued makes them more appealing for regression for a number of reasons. For example, the fact that the values in that column have a meaning that goes beyond integers and natural numbers. 

Let's focus on xand x10, which are the variables for measuring the cholesterol level and ST depression induced by exercise relative to rest, respectively. What if we want to change the original research question the doctors intended, which was to study heart disease based on different factors? What if now we want to use all the factors, including knowing whether patients have heart disease or not, to determine or predict their cholesterol level? We can do that with regression on x5.

So, to prepare the data on x5 and x10, we will go ahead and scale the data. For verification purposes, we will retrieve descriptive statistics on the data before and after the scaling of the data.

To reload the dataset and display descriptive statistics, we can do the following:

df = pd.read_csv('processed.cleveland.data', header=None)
df[[4,9]].describe()

In this case, index, 4 and 9 correspond to x5 and x10, and the describe() method outputs the following information:

                 4            9
count 303.000000 303.000000
mean 246.693069 1.039604
std 51.776918 1.161075
min 126.000000 0.000000
25% 211.000000 0.000000
50% 241.000000 0.800000
75% 275.000000 1.600000
max 564.000000 6.200000

The most notable properties are the mean, and maximum/minimum values contained in that column. These will change once we scale the data to a different range. If we visualize the data as a scatter plot with respective histograms, it looks like Figure 3.2:

Figure 3.2 – Scatter plot of the two columns  x 5 and  x 10 and their corresponding histograms

As can be seen from Figure 3.2, the ranges are quite different, and the distribution of the data is different as well. The new desired range here is a minimum of 0 and a maximum of 1. This range is typical when we scale the data. And it can be achieved using scikit-learn's MinMaxScaler object as follows:

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(df[[4,9]])
df[[4,9]] = scaler.transform(df[[4,9]])
df[[4,9]].describe()

This will output the following:

                4            9
count 303.000000 303.000000
mean 0.275555 0.167678
std 0.118212 0.187270
min 0.000000 0.000000
25% 0.194064 0.000000
50% 0.262557 0.129032
75% 0.340183 0.258065
max 1.000000 1.000000

What the fit() method does internally is to determine what the current min and max values are for the data. Then, the transform() method uses that information to remove the minimum and divide by the maximum to achieve the desired range. As can be seen, the new descriptive statistics have changed, which can be confirmed by looking at the range in the axes of Figure 3.3:

Figure 3.3 – Scatter plot of the newly scaled columns  x 5 and  x 10 and their corresponding histograms 

Notice, however, if you pay close attention, that the distribution of the data has not changed. That is, the histograms of the data in Figure 3.2 and Figure 3.3 are still the same. And this is a very important fact because, usually, you do not want to change the distribution of the data.

Standardizing to zero mean and unit variance

Another way of preprocessing real-valued data is by making it have zero mean and unit variance. This process is referred to by many names, such as normalizing, z-scoring, centering, or standardizing.

Let's say that x=[x5x10], from our features above, then we can standardize x as follows:

Here, µ is a vector corresponding to the means of each column on x, and σ is a vector of standard deviations of each column in x.

After the standardization of x, if we recompute the mean and standard deviation, we should get a mean of zero and a standard deviation of one. In Python, we do the following:

df[[4,9]] = (df[[4,9]]-df[[4,9]].mean())/df[[4,9]].std()
df[[4,9]].describe()

This will output the following:

                   4                9
count 3.030000e+02 3.030000e+02
mean 1.700144e-16 -1.003964e-16
std 1.000000e+00 1.000000e+00
min -2.331021e+00 -8.953805e-01
25% -6.893626e-01 -8.953805e-01
50% -1.099538e-01 -2.063639e-01
75% 5.467095e-01 4.826527e-01
max 6.128347e+00 4.444498e+00

Notice that after normalization, the mean is, for numerical purposes, zero. And the standard deviation is one. The same thing can be done, of course, using the scikit-learn StandardScaler object as follows:

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(df[[4,9]])
df[[4,9]] = scaler.transform(df[[4,9]])

This will yield the same results with negligible numerical differences. For practical purposes, both methods will achieve the same thing.

Although both ways of normalizing are appropriate, in the DataFrame directly or using a  StandardScaler object, you should prefer using the  StandardScaler object if you are working on a production application. Once the  StandardScaler object uses the  fit() method, it can be used on new, unseen, data easily by re-invoking transform() method; however, if we do it directly on the pandas DataFrame, we will have to manually store the mean and standard deviation somewhere and reload it every time we need to standardize new data.

Now, for comparison purposes, Figure 3.4 depicts the new ranges after the normalization of the data. If you look at the axes closely, you will notice that the position of the zero values are where most of the data is, that is, where the mean is. Therefore, the cluster of data is centered around a mean of zero:

Figure 3.4 – Scatter plot of the standardized columns  x 5 and  x 10 and their corresponding histograms 

Notice, again, that in Figure 3.4, after applying the standardization process, the distribution of the data still does not change. But what if you actually want to change the distribution of the data? Keep reading on to the next section.