The Data Wrangling Workshop
上QQ阅读APP看书,第一时间看更新

Statistics and Visualization with NumPy and Pandas

One of the great advantages of using libraries such as NumPy and pandas is that a plethora of built-in statistical and visualization methods are available, for which we don't have to search for and write new code. Furthermore, most of these subroutines are written using C or Fortran code (and pre-compiled), making them extremely fast to execute.

Refresher on Basic Descriptive Statistics

For any data wrangling task, it is quite useful to extract basic descriptive statistics, which should describe the data in ways such as the mean, median, and mode and create some simple visualizations or plots. These plots are often the first step in identifying fundamental patterns as well as oddities (if present) in the data. In any statistical analysis, descriptive statistics is the first step, followed by inferential statistics, which tries to infer the underlying distribution or process that the data might have been generated from. You can imagine that descriptive statistics will inform us of the basic characteristics of the data, while inferential statistics will help us understand not only the data we are working with but alternative data that we might be experimenting with.

Since inferential statistics is intimately coupled with the machine learning/predictive modeling stage of a data science pipeline, descriptive statistics naturally becomes associated with the data wrangling aspect.

There are two broad approaches to descriptive statistical analysis:

  • Graphical techniques: Bar plots, scatter plots, line charts, box plots, histograms, and so on
  • The calculation of the central tendency and spread: Mean, median, mode, variance, standard deviation, range, and so on

In this section, we will demonstrate how you can accomplish both of these tasks using Python. Apart from NumPy and pandas, we will need to learn the basics of another great package – matplotlib – which is the most powerful and versatile visualization library in Python.

Exercise 3.17: Introduction to Matplotlib through a Scatter Plot

In this exercise, we will demonstrate the power and simplicity of matplotlib by creating a simple scatter plot from self-created data about the age, weight, and height of a few people. To do so, let's go through the following steps:

  1. First, we will define simple lists of the names of people, along with their age, weight (in kgs), and height (in centimeters):

    people = ['Ann','Brandon','Chen','David','Emily',\

              'Farook','Gagan','Hamish','Imran',\

              'Joseph','Katherine','Lily']

    age = [21,12,32,45,37,18,28,52,5,40,48,15]

    weight = [55,35,77,68,70,60,72,69,18,65,82,48]

    height = [160,135,170,165,173,168,175,159,105,\

              171,155,158]

  2. Import the most important module from matplotlib, called pyplot:

    import matplotlib.pyplot as plt

  3. Create simple scatter plots of age versus weight:

    plt.scatter(age,weight)

    plt.show()

    The output is as follows:

    Figure 3.20: A screenshot of a scatter plot containing age and weight

    The preceding plot can be improved by enlarging the figure size, customizing the aspect ratio, adding a title with a proper font size, adding x-axis and y-axis labels with a customized font size, adding grid lines, changing the y-axis limit to be between 0 and 100, adding x and y tick marks, customizing the scatter plot's color, and changing the size of the scatter dots.

  4. The code for the improved plot is as follows:

    plt.figure(figsize=(8,6))

    plt.title("Plot of Age vs. Weight (in kgs)",\

              fontsize=20)

    plt.xlabel("Age (years)",fontsize=16)

    plt.ylabel("Weight (kgs)",fontsize=16)

    plt.grid (True)

    plt.ylim(0,100)

    plt.xticks([i*5 for i in range(12)],fontsize=15)

    plt.yticks(fontsize=15)

    plt.scatter(x=age,y=weight,c='orange',s=150,\

                edgecolors='k')

    plt.text(x=20,y=85,s="Weights after 18-20 years of age",\

             fontsize=15)

    plt.vlines(x=20,ymin=0,ymax=80,linestyles='dashed',\

               color=›blue›,lw=3)

    plt.legend([‹Weight in kgs›],loc=2,fontsize=12)

    plt.show()

    The output is as follows:

Figure 3.21: A screenshot of a scatter plot showing age versus weight

We can observe the following things:

  • A tuple (8,6) is passed as an argument for the figure size.
  • A list comprehension is used inside xticks to create a customized list of 5-10-15-…-55.
  • A newline (\n) character is used inside the plt.text() function to break up and distribute the text into two lines.
  • The plt.show() function is used at the very end. The idea is to keep on adding various graphics properties (font, color, axis limits, text, legend, grid, and so on) until you are satisfied and then show the plot with one function. The plot will not be displayed without this last function call.

The preceding plot is quite self-explanatory. We can observe that the variations in weight are reduced after 18-20 years of age.

Note

To access the source code for this specific section, please refer to https://packt.live/3hFzysK.

You can also run this example online at https://packt.live/3eauxWP.

In this exercise, we have gone through the basics of using matplotlib, a popular charting function. In the next section, we will look at the definition of statistical measures.