Deep Learning for Beginners
上QQ阅读APP看书,第一时间看更新

Altering the distribution of data

It has been demonstrated that changing the distribution of the targets, particularly in the case of regression, can have positive benefits in the performance of a learning algorithm (Andrews, D. F., et al. (1971)).

Here, we'll discuss one particularly useful transformation known as Quantile Transformation. This methodology aims to look at the data and manipulate it in such a way that its histogram follows either a normal distribution or a uniform distribution. It achieves this by looking at estimates of quantiles. 

We can use the following commands to transform the same data as in the previous section:

from sklearn.preprocessing import QuantileTransformer
transformer = QuantileTransformer(output_distribution='normal')
df[[4,9]] = transformer.fit_transform(df[[4,9]])

This will effectively map the data into a new distribution, namely, a normal distribution. 

Here, the term  normal distribution refers to a Gaussian-like probability density function ( PDF). This is a classic distribution found in any statistics textbook. It is usually identified by its bell-like shape when plotted. 

Note that we are also using the fit_transform() method, which does both fit() and transform() at the same time, which is convenient.

As can be seen in Figure 3.5, the variable related to cholesterol data, x5, was easily transformed into a normal distribution with a bell shape. However, for x10, the heavy presence of data in a particular region causes the distribution to have a bell shape, but with a long tail, which is not ideal:

Figure 3.5 – Scatter plot of the normally transformed columns  x 5 and  x 10 and their corresponding Gaussian-like histograms 

The process of transforming the data for a uniform distribution is very similar. We simply need to make a small change in one line, on the QuantileTransformer() constructor, as follows:

transformer = QuantileTransformer(output_distribution='uniform')

Now, the data is transformed into a uniform distribution, as shown in Figure 3.6:

Figure 3.6 – Scatter plot of the uniformly transformed columns  x 5 and  x 10 and their corresponding uniform histograms 

From the figure, we can see that the data has been uniformly distributed across each variable. Once again, the clustering of data in a particular region has the effect of causing a large concentration of values in the same space, which is not ideal. This artifact also creates a gap in the distribution of the data that is usually difficult to handle, unless we use techniques to augment the data, which we'll discuss next.