Querying data in Pandas_Python Data Analysis（Second Edition）-短篇小说

上QQ阅读APP看书，第一时间看更新

Querying data in Pandas

Since a Pandas DataFrame is structured in a similar way to a relational database, we can view operations that read data from a DataFrame as a query. In this example, we will retrieve the annual sunspot data from Quandl. We can either use the Quandl API or download the data manually as a CSV file from http://www.quandl.com/SIDC/SUNSPOTS_A-Sunspot-Numbers-Annual. If you want to install the API, you can do so by downloading installers from https://pypi.python.org/pypi/Quandl or by running the following command:

$ pip3 install Quandl

Note

Using the API is free, but is limited to 50 API calls per day. If you require more API calls, you will have to request an authentication key. The code in this tutorial is not using a key. It should be simple to change the code to either use a key or read a downloaded CSV file. If you have difficulties, refer to the Where to find help and references section in Chapter 1, Getting Started with Python Libraries, or search through the Python docs at https://docs.python.org/2/.

Without further preamble, let's take a look at how to query data in a Pandas DataFrame:

As a first step, we obviously have to download the data. After importing the Quandl API, get the data as follows:

        import quandl 
 
        # Data from 
        http://www.quandl.com/SIDC/SUNSPOTS_A-Sunspot-Numbers-Annual 
        # PyPi url https://pypi.python.org/pypi/Quandl 
        sunspots = quandl.get("SIDC/SUNSPOTS_A")

The head() and tail() methods have a purpose similar to that of the Unix commands with the same name. Select the first n and last n records of a DataFrame, where n is an integer parameter:
```
        print("Head 2", sunspots.head(2) ) 
        print("Tail 2", sunspots.tail(2)) 
```
This gives us the first two and last two rows of the sunspot data (for the sake of brevity we have not shown all the columns here; your output will have all the columns form the dataset):
```
          Head 2             Number
          Year              
          1700-12-31         5
          1701-12-31        11

          [2 rows x 1 columns]
          Tail 2             Number
          Year              
          2012-12-31    57.7
          2013-12-31    64.9
          
          [2 rows x 1 columns]
```
Please note that we only have one column holding the number of sunspots per year. The dates are a part of the DataFrame index.

The following is the query for the last value using the last date:

        last_date = sunspots.index[-1] 
        print("Last value", sunspots.loc[last_date])

You can check the following output with the result from the previous step:

          Last value Number    64.9
          Name: 2013-12-31 00:00:00, dtype: float64

Query the date with date strings in the YYYYMMDD format as follows:

        print("Values slice by date:\n", sunspots["20020101": 
        "20131231"])

This gives the records from 2002 through to 2013:

          Values slice by date             Number
          Year              
          2002-12-31   104.0
          [TRUNCATED]
          2013-12-31    64.9
          
          [12 rows x 1 columns]

A list of indices can be used to query as well:

        print("Slice from a list of indices:\n", sunspots.iloc[[2, 4, 
        -4, -2]])

The preceding code selects the following rows:

          Slice from a list of indices             Number
          Year              
          1702-12-31    16.0
          1704-12-31    36.0
          2010-12-31    16.0
          2012-12-31    57.7

          [4 rows x 1 columns]

To select scalar values, we have two options. The second option given here should be faster. Two integers are required, the first for the row and the second for the column:
```
        print("Scalar with Iloc:", sunspots.iloc[0, 0]) 
        print("Scalar with iat", sunspots.iat[1, 0]) 
```
This gives us the first and second values of the dataset as scalars:
```
          Scalar with Iloc 5.0
          Scalar with iat 11.0
```

Querying with Booleans works much like the Where clause of SQL. The following code queries for values larger than the arithmetic mean. Note that there is a difference between when we perform the query on the whole DataFrame and when we perform it on a single column:

        print("Boolean selection", sunspots[sunspots > 
        sunspots.mean()]) 
        print("Boolean selection with column label:\n",          
        sunspots[sunspots['Number of Observations'] > sunspots['Number 
        of Observations'].mean()])

The notable difference is that the first query yields all the rows, with some rows not conforming to the condition that has a value of NaN. The second query returns only the rows where the value is larger than the mean:

            Boolean selection             Number
            Year                      
            1700-12-31     NaN
            [TRUNCATED]
            1759-12-31    54.0              
                           ...

           [314 rows x 1 columns]
           Boolean selection with column label             Number
           Year              
           1705-12-31    58.0
           [TRUNCATED]
           1870-12-31   139.1               
                          ...

           [127 rows x 1 columns]

The preceding example code is in the ch_03.ipynb file of this book's code bundle.