Python Data Analysis(Second Edition)
上QQ阅读APP看书,第一时间看更新

The Pandas Series

The Pandas Series data structure is a one-dimensional, heterogeneous array with labels. We can create a Pandas Series data structure as follows:

  • Using a Python dict
  • Using a NumPy array
  • Using a single scalar value

When creating a Series, we can hand the constructor a list of axis labels, which is commonly referred to as the index. The index is an optional parameter. By default, if we use a NumPy array as the input data, Pandas will index values by autoincrementing the index commencing from 0. If the data handed to the constructor is a Python dict, the sorted dict keys will become the index. In the case of a scalar value as the input data, we are required to supply the index. For each new value in the index, the scalar input value will be reiterated. The Pandas Series and DataFrame interfaces have features and behaviors borrowed from NumPy arrays and Python dictionaries, such as slicing, a lookup function that uses a key, and vectorized operations. Performing a lookup on a DataFrame column returns a Series. We will demonstrate this and other features of Series by going back to the previous section and loading the CSV file again:

  1. We will start by selecting the Country column, which happens to be the first column in the datafile. Then, show the type of the object currently in the local scope:
            country_col = df["Country"] 
            print("Type df", type(df)) 
            print("Type country col", type(country_col)) 
    

    We can now confirm that we get a Series when we select a column of a DataFrame:

              Type df <class 'pandas.core.frame.DataFrame'>
              Type country col  <class 'pandas.core.series.Series'>
    Note

    If you want, you can open a Python or IPython shell, import Pandas, and, using the dir() function, view a list of functions and attributes for the classes found in the previous printout. However, be aware that you will get a long list of functions in both cases.

  2. The Pandas Series data structure shares some of the attributes of DataFrame, and also has a name attribute. Explore these properties as follows:
            print("Series shape", country_col.shape) 
            print("Series index", country_col.index) 
            print("Series values", country_col.values) 
            print("Series name", country_col.name) 
    

    The output (truncated to save space) is given as follows:

              Series shape (202,)
              Series index Int64Index([0, 1, 2, 3, 4, 5, 
              6, 7, 8, 9, 10, 11, 12, ...], dtype='int64')
              Series values ['Afghanistan' ... 'Vietnam' 'West Bank and          
              Gaza' 'Yemen' 'Zambia' 'Zimbabwe']
              Series name Country
  3. To demonstrate the slicing of a Series, select the last two countries of the Country Series and print the type:
            print("Last 2 countries", country_col[-2:]) 
            print("Last 2 countries type", type(country_col[-2:])) 
    

    Slicing yields another Series, as demonstrated here:

              Last 2 countries
              200      Zambia
              201    Zimbabwe
              Name: Country, dtype: object
              Last 2 countries type <class 'pandas.core.series.Series'>
  4. NumPy functions can operate on Pandas DataFrame and Series. We can, for instance, apply the NumPy sign() function, which yields the sign of a number. 1 is returned for positive numbers, -1 for negative numbers, and 0 for zeros. Apply the function to the DataFrame's last column, which happens to be the population for each country in the dataset:
            last_col = df.columns[-1] 
            print("Last df column signs:\n", last_col, 
            np.sign(df[last_col]), "\n") 
    

    The output is truncated here to save space, and is as follows:

              Last df column signs Population (in thousands) total 0     1
              1     1
              [TRUNCATED]
              198   NaN
              199     1
              200     1
              201     1
              Name: Population (in thousands) total, Length: 202, dtype: 
              float64
Note

Please note that the population value at index 198 is NaN. The matching record is given as follows: West Bank and Gaza,199,1,,,,,,

We can perform all sorts of numerical operations between DataFrames, Series, and NumPy arrays. If we get the underlying NumPy array of a Pandas Series and subtract this array from the Series, we can reasonably expect the following two outcomes:

  • An array filled with zeros and at least one NaN (we saw one NaN in the previous step)
  • We can also expect to get only zeros

The rule for NumPy functions is to produce NaNs for most operations involving NaNs, as illustrated by the following IPython session:

In: np.sum([0, np.nan])
Out: nan

Write the following code to perform the subtraction:

print np.sum(df[last_col] - df[last_col].values) 

The snippet yields the result predicted by the second option:

0.0

Please refer to the ch-03.ipynb file in this book's code bundle.