Python Data Analysis(Second Edition)
上QQ阅读APP看书,第一时间看更新

Querying data in Pandas

Since a Pandas DataFrame is structured in a similar way to a relational database, we can view operations that read data from a DataFrame as a query. In this example, we will retrieve the annual sunspot data from Quandl. We can either use the Quandl API or download the data manually as a CSV file from http://www.quandl.com/SIDC/SUNSPOTS_A-Sunspot-Numbers-Annual. If you want to install the API, you can do so by downloading installers from https://pypi.python.org/pypi/Quandl or by running the following command:

$ pip3 install Quandl
Note

Using the API is free, but is limited to 50 API calls per day. If you require more API calls, you will have to request an authentication key. The code in this tutorial is not using a key. It should be simple to change the code to either use a key or read a downloaded CSV file. If you have difficulties, refer to the Where to find help and references section in Chapter 1, Getting Started with Python Libraries, or search through the Python docs at https://docs.python.org/2/.

Without further preamble, let's take a look at how to query data in a Pandas DataFrame:

  1. As a first step, we obviously have to download the data. After importing the Quandl API, get the data as follows:
            import quandl 
     
            # Data from 
            http://www.quandl.com/SIDC/SUNSPOTS_A-Sunspot-Numbers-Annual 
            # PyPi url https://pypi.python.org/pypi/Quandl 
            sunspots = quandl.get("SIDC/SUNSPOTS_A") 
    
  2. The head() and tail() methods have a purpose similar to that of the Unix commands with the same name. Select the first n and last n records of a DataFrame, where n is an integer parameter:
            print("Head 2", sunspots.head(2) ) 
            print("Tail 2", sunspots.tail(2)) 
    

    This gives us the first two and last two rows of the sunspot data (for the sake of brevity we have not shown all the columns here; your output will have all the columns form the dataset):

              Head 2             Number
              Year              
              1700-12-31         5
              1701-12-31        11
    
              [2 rows x 1 columns]
              Tail 2             Number
              Year              
              2012-12-31    57.7
              2013-12-31    64.9
              
              [2 rows x 1 columns]

    Please note that we only have one column holding the number of sunspots per year. The dates are a part of the DataFrame index.

  3. The following is the query for the last value using the last date:
            last_date = sunspots.index[-1] 
            print("Last value", sunspots.loc[last_date]) 
    

    You can check the following output with the result from the previous step:

              Last value Number    64.9
              Name: 2013-12-31 00:00:00, dtype: float64
  4. Query the date with date strings in the YYYYMMDD format as follows:
            print("Values slice by date:\n", sunspots["20020101": 
            "20131231"]) 
    

    This gives the records from 2002 through to 2013:

              Values slice by date             Number
              Year              
              2002-12-31   104.0
              [TRUNCATED]
              2013-12-31    64.9
              
              [12 rows x 1 columns]
  5. A list of indices can be used to query as well:
            print("Slice from a list of indices:\n", sunspots.iloc[[2, 4, 
            -4, -2]]) 
    

    The preceding code selects the following rows:

              Slice from a list of indices             Number
              Year              
              1702-12-31    16.0
              1704-12-31    36.0
              2010-12-31    16.0
              2012-12-31    57.7
    
              [4 rows x 1 columns]
  6. To select scalar values, we have two options. The second option given here should be faster. Two integers are required, the first for the row and the second for the column:
            print("Scalar with Iloc:", sunspots.iloc[0, 0]) 
            print("Scalar with iat", sunspots.iat[1, 0]) 
    

    This gives us the first and second values of the dataset as scalars:

              Scalar with Iloc 5.0
              Scalar with iat 11.0
  7. Querying with Booleans works much like the Where clause of SQL. The following code queries for values larger than the arithmetic mean. Note that there is a difference between when we perform the query on the whole DataFrame and when we perform it on a single column:
            print("Boolean selection", sunspots[sunspots > 
            sunspots.mean()]) 
            print("Boolean selection with column label:\n",          
            sunspots[sunspots['Number of Observations'] > sunspots['Number 
            of Observations'].mean()]) 
    

    The notable difference is that the first query yields all the rows, with some rows not conforming to the condition that has a value of NaN. The second query returns only the rows where the value is larger than the mean:

                Boolean selection             Number
                Year                      
                1700-12-31     NaN
                [TRUNCATED]
                1759-12-31    54.0              
                               ...
    
               [314 rows x 1 columns]
               Boolean selection with column label             Number
               Year              
               1705-12-31    58.0
               [TRUNCATED]
               1870-12-31   139.1               
                              ...
    
               [127 rows x 1 columns]

The preceding example code is in the ch_03.ipynb file of this book's code bundle.