Altering the distribution of data
It has been demonstrated that changing the distribution of the targets, particularly in the case of regression, can have positive benefits in the performance of a learning algorithm (Andrews, D. F., et al. (1971)).
Here, we'll discuss one particularly useful transformation known as Quantile Transformation. This methodology aims to look at the data and manipulate it in such a way that its histogram follows either a normal distribution or a uniform distribution. It achieves this by looking at estimates of quantiles.
We can use the following commands to transform the same data as in the previous section:
from sklearn.preprocessing import QuantileTransformer
transformer = QuantileTransformer(output_distribution='normal')
df[[4,9]] = transformer.fit_transform(df[[4,9]])
This will effectively map the data into a new distribution, namely, a normal distribution.
Note that we are also using the fit_transform() method, which does both fit() and transform() at the same time, which is convenient.
As can be seen in Figure 3.5, the variable related to cholesterol data, x5, was easily transformed into a normal distribution with a bell shape. However, for x10, the heavy presence of data in a particular region causes the distribution to have a bell shape, but with a long tail, which is not ideal:
The process of transforming the data for a uniform distribution is very similar. We simply need to make a small change in one line, on the QuantileTransformer() constructor, as follows:
transformer = QuantileTransformer(output_distribution='uniform')
Now, the data is transformed into a uniform distribution, as shown in Figure 3.6:
From the figure, we can see that the data has been uniformly distributed across each variable. Once again, the clustering of data in a particular region has the effect of causing a large concentration of values in the same space, which is not ideal. This artifact also creates a gap in the distribution of the data that is usually difficult to handle, unless we use techniques to augment the data, which we'll discuss next.