Python Data Analysis(Second Edition)
上QQ阅读APP看书,第一时间看更新

Data aggregation with Pandas DataFrames

Data aggregation is a term used in the field of relational databases. In a database query, we can group data by the value in a column or columns. We can then perform various operations on each of these groups. The Pandas DataFrame has similar capabilities. We will generate data held in a Python dict and then use this data to create a Pandas DataFrame. We will then practice the Pandas aggregation features:

  1. Seed the NumPy random generator to make sure that the generated data will not differ between repeated program runs. The data will have four columns:
    • Weather (a string)
    • Food (also a string)
    • Price (a random float)
    • Number (a random integer between one and nine)

    The use case is that we have the results of some sort of consumer-purchase research, combined with weather and market pricing, where we calculate the average of prices and keep a track of the sample size and parameters:

                import pandas as pd
                from numpy.random import seed
                from numpy.random import rand
                from numpy.random import rand_int
                import numpy as np
    
                seed(42)
    
                df = pd.DataFrame({'Weather' : ['cold', 'hot', 'cold',   
                'hot', 'cold', 'hot', 'cold'],
                'Food' : ['soup', 'soup', 'icecream', 'chocolate',
                'icecream', 'icecream', 'soup'],
                'Price' : 10 * rand(7), 'Number' : rand_int(1, 9,)})
                 print(df)

    You should get an output similar to the following:

                      Food  Number     Price Weather
               0       soup       8  3.745401    cold
               1       soup       5  9.507143     hot
               2   icecream       4  7.319939    cold
               3  chocolate       8  5.986585     hot
               4   icecream       8  1.560186    cold
               5   icecream       3  1.559945     hot
               6       soup       6  0.580836    cold
    
               [7 rows x 4 columns]
    Note

    Please note that the column labels come from the lexically ordered keys of the Python dict. Lexical or lexicographical order is based on the alphabetic order of characters in a string.

  2. Group the data by the Weather column and then iterate through the groups as follows:
            weather_group = df.groupby('Weather') 
     
            i = 0 
     
            for name, group in weather_group: 
               i = i + 1 
               print("Group", i, name) 
               print(group) 
    

    We have two types of weather, hot and cold, so we get two groups:

               Group 1 cold
                      Food  Number     Price Weather
               0      soup       8  3.745401    cold
               2  icecream       4  7.319939    cold
               4  icecream       8  1.560186    cold
               6      soup       6  0.580836    cold
    
               [4 rows x 4 columns]
               Group 2 hot
                       Food  Number     Price Weather
               1       soup       5  9.507143     hot
               3  chocolate       8  5.986585     hot
               5   icecream       3  1.559945     hot
    
               [3 rows x 4 columns]
  3. The weather_group variable is a special Pandas object that we get as a result of the groupby() method. This object has aggregation methods, which are demonstrated as follows:
            print("Weather group first\n", weather_group.first()) 
            print("Weather group last\n", weather_group.last()) 
            print("Weather group mean\n", weather_group.mean()) 
    

    The preceding code snippet prints the first row, last row, and mean of each group:

               Weather group first
                        Food  Number     Price
               Weather                        
               cold     soup       8  3.745401
               hot      soup       5  9.507143
    
               [2 rows x 3 columns]
               Weather group last
                        Food  Number     Price
               Weather                            
               cold         soup       6  0.580836
               hot      icecream       3  1.559945
    
               [2 rows x 3 columns]
               Weather group mean           Number     Price
               Weather                    
               cold     6.500000  3.301591
               hot      5.333333  5.684558
    
               [2 rows x 2 columns]
  4. Just as in a database query, we are allowed to group on multiple columns. The groups attribute will then tell us the groups that are formed, as well as the rows in each group:
            wf_group = df.groupby(['Weather', 'Food']) 
            print("WF Groups", wf_group.groups) 
    

    For each possible combination of weather and food values, a new group is created. The membership of each row is indicated by their index values as follows:

            WF Groups {('hot', 'chocolate'): [3], ('cold', 'icecream'):
            [2, 4], ('hot', 'icecream'): [5], ('hot', 'soup'): [1],  
            ('cold', 'soup'): [0, 6]} 
    
  5. Apply a list of NumPy functions on groups with the agg() method:
            print("WF Aggregated\n", wf_group.agg([np.mean, np.median])) 
    

    Obviously, we could apply even more functions, but it would look messier than the following output:

    WF Aggregated                    Number             Price                                            
                                       mean  median      mean    median
    Weather Food
    cold    icecream                      6       6  4.440063  4.440063        
            soup                          7       7  2.163119  2.163119
    hot     chocolate                     8       8  5.986585  5.986585              
            icecream                      3       3  1.559945  1.559945           
            soup                          5       5  9.507143  9.507143
    
    [5 rows x 4 columns]

The full data aggregation example code is in the ch-03.ipynb file, which can be found in this book's code bundle.