Data aggregation with Pandas DataFrames
Data aggregation is a term used in the field of relational databases. In a database query, we can group data by the value in a column or columns. We can then perform various operations on each of these groups. The Pandas DataFrame has similar capabilities. We will generate data held in a Python dict and then use this data to create a Pandas DataFrame. We will then practice the Pandas aggregation features:
- Seed the NumPy random generator to make sure that the generated data will not differ between repeated program runs. The data will have four columns:
Weather
(a string)Food
(also a string)Price
(a random float)Number
(a random integer between one and nine)
The use case is that we have the results of some sort of consumer-purchase research, combined with weather and market pricing, where we calculate the average of prices and keep a track of the sample size and parameters:
import pandas as pd from numpy.random import seed from numpy.random import rand from numpy.random import rand_int import numpy as np seed(42) df = pd.DataFrame({'Weather' : ['cold', 'hot', 'cold', 'hot', 'cold', 'hot', 'cold'], 'Food' : ['soup', 'soup', 'icecream', 'chocolate', 'icecream', 'icecream', 'soup'], 'Price' : 10 * rand(7), 'Number' : rand_int(1, 9,)}) print(df)
You should get an output similar to the following:
Food Number Price Weather 0 soup 8 3.745401 cold 1 soup 5 9.507143 hot 2 icecream 4 7.319939 cold 3 chocolate 8 5.986585 hot 4 icecream 8 1.560186 cold 5 icecream 3 1.559945 hot 6 soup 6 0.580836 cold [7 rows x 4 columns]
Note
Please note that the column labels come from the lexically ordered keys of the Python dict. Lexical or lexicographical order is based on the alphabetic order of characters in a string.
- Group the data by the
Weather
column and then iterate through the groups as follows:weather_group = df.groupby('Weather') i = 0 for name, group in weather_group: i = i + 1 print("Group", i, name) print(group)
We have two types of weather, hot and cold, so we get two groups:
Group 1 cold Food Number Price Weather 0 soup 8 3.745401 cold 2 icecream 4 7.319939 cold 4 icecream 8 1.560186 cold 6 soup 6 0.580836 cold [4 rows x 4 columns] Group 2 hot Food Number Price Weather 1 soup 5 9.507143 hot 3 chocolate 8 5.986585 hot 5 icecream 3 1.559945 hot [3 rows x 4 columns]
- The
weather_group
variable is a special Pandas object that we get as a result of thegroupby()
method. This object has aggregation methods, which are demonstrated as follows:print("Weather group first\n", weather_group.first()) print("Weather group last\n", weather_group.last()) print("Weather group mean\n", weather_group.mean())
The preceding code snippet prints the first row, last row, and mean of each group:
Weather group first Food Number Price Weather cold soup 8 3.745401 hot soup 5 9.507143 [2 rows x 3 columns] Weather group last Food Number Price Weather cold soup 6 0.580836 hot icecream 3 1.559945 [2 rows x 3 columns] Weather group mean Number Price Weather cold 6.500000 3.301591 hot 5.333333 5.684558 [2 rows x 2 columns]
- Just as in a database query, we are allowed to group on multiple columns. The
groups
attribute will then tell us the groups that are formed, as well as the rows in each group:wf_group = df.groupby(['Weather', 'Food']) print("WF Groups", wf_group.groups)
For each possible combination of weather and food values, a new group is created. The membership of each row is indicated by their index values as follows:
WF Groups {('hot', 'chocolate'): [3], ('cold', 'icecream'): [2, 4], ('hot', 'icecream'): [5], ('hot', 'soup'): [1], ('cold', 'soup'): [0, 6]}
- Apply a list of NumPy functions on groups with the
agg()
method:print("WF Aggregated\n", wf_group.agg([np.mean, np.median]))
Obviously, we could apply even more functions, but it would look messier than the following output:
WF Aggregated Number Price mean median mean median Weather Food cold icecream 6 6 4.440063 4.440063 soup 7 7 2.163119 2.163119 hot chocolate 8 8 5.986585 5.986585 icecream 3 3 1.559945 1.559945 soup 5 5 9.507143 9.507143 [5 rows x 4 columns]
The full data aggregation example code is in the ch-03.ipynb
file, which can be found in this book's code bundle.