Practical Data Analysis Using Jupyter Notebook
上QQ阅读APP看书,第一时间看更新

Practical use cases of NumPy and arrays

Let's walk through a practical use case for working with a one-dimensional array in data analysis.Here's the scenario—you are a data analyst who wants to know what is the highest daily closing price for a stock ticker for the current Year To Date (YTD). To do this, you can use an array to store each value as an element, sort the price element from high to low, and then print the first element, which would display the highest price as the output value.

Before loading the file into Jupyter, it is best to inspect the file contents, which supports our Know Your Data (KYD) concept discussed inChapter 1, Fundamentals of Data Analysis.The following screenshot is a comma-delimited, structured dataset with two columns. The file includes a header row with a Date field in the format of YYYY-MM-DD and a field labeled Close, which represents the closing price of the stock by the end of the trading day for this stock ticker.This data was downloaded from Yahoo Business, manually changed to exclude some columns, and then stored as a file in the comma-delimited format.The file name represents the ticker of the stock, so AAPL represents the Apple Company, which is a publicly-traded company on the National Association of Securities Dealers Automated Quotations (NASDAQ) stock exchange:

The first step would be to load the file that contains the data. I have placed this file in this book's GitHub repository for convenience, so go ahead and set up a new project folder using the best practices covered in Chapter 2, Overview of Python and Installing Jupyter Notebook, by launching a new Jupyter Notebook.

Working with the syntax of Python is explicit and case sensitive so don't be discouraged if the expected output is wrong or you receive an error message.In most cases, a simple change in the code will resolve the issue and you can re-run the command.

For this scenario, there are a few options to load data in an array using NumPy.

Assigning values to arrays manually

The first option would be to explicitly assign the values to an array manually, as shown in the following screenshot:

This option is fine for small datasets, testing syntax, or other specific use cases but will be impractical when working with big data or multiple data files. We took a few shortcuts using this option by only typing in a sampling of ten values from the source file. Since all of the stock prices are numeric and have a consistent data type, we can use a one-dimensional array that has a default dtype of float.

The steps to reproduce this option are as follows:

  1. Launch Jupyter and create a new Python notebook.
  2. To stay consistent with best practices, be sure to rename the notebook highest_daily_closing_stock_price_option_1 before moving forward.
  3. Type in the following command to import the numpy library in the notebook, input In []:, and run the cell:
In []: import numpy as np
  1. In the next input cell, add the following command to assign a NumPy array of values using the shortcut of np and assigning it to a variable named input_stock_price_array. Proceed by running the cell, which will not produce an output, Out []:
input_stock_price_array = np.array([142.19,148.26,147.93,150.75,153.31,153.8,152.28,150,153.07,154.94])
  1. In the next input In []: cell, add the following command to assign a NumPy array of values to a variable named sorted_stock_price_array and run the cell. Similar to before, the result will not produce an output, Out []:
sorted_stock_price_array = np.sort(input_stock_price_array)[::-1] 
  1. Type in the following commands, which use the print() function to display the results of each of the array variables:
print('Closing stock price in order of day traded: ', input_stock_price_array)
print('Closing stock price in order from high to low: ', sorted_stock_price_array)
Press the Enter key to create a new line so you can add the second line command before running the cell.
  1. Verify that the output cell displaysOut []:
  • There will be two rows of output with the first as the original array of values.
  • The second output row is a sorted list of the values from the array.
  1. Type in the following command to use the print() function to display the result:
print('Highest closing stock price: ', sorted_stock_price_array[0]) 
  1. Verify that the output cell displaysOut []. The output should state Highest closing stock price: 154.94.

The key concepts to remember from these steps are that you load an initial array of stock price values and name it input_stock_price_array. This step was done after importing the NumPy library and assigning it to the npshortcut, which is a best practice. Next, you create a new array from the original, name itsorted_stock_price_array, and use thesort()function from NumPy. The benefit of thesort()function is that it will automatically order the elements of the original array from low to high. Since the goal of this scenario is to get the highest value, we add the[::-1] parameter to the function, which sorts the elements of values in descending order.

Creating a new array from the original array helps to make your analysis easier to repeat and reuse. The order of operation becomes critical in the process so you must walk through the steps in sequence to get the correct results.

To verify the results, we add an extra step to print both arrays together to visually compare the elements and confirm that the new array is sorted in descending order. Since the original task was to get the highest stock price, the final step is to print the first element in the sorted array, which has an index value of 0. If the steps are performed without any errors, you'll see the highest closing stock price from the sampling of data, that is, 154.94.

Assigning values to arrays directly

A more scalable option versus manually assigning values in the array is to use another NumPy command called the genfromtxt() function, which is available in the numpy library.Using this function, we can assign the array elements directly from reading in records from the file by row and column. The genfromtxt() function has a few parameters to support handling the structure of the data by isolating the specific column needed and its data type.

There are multiple required and optional parameters for the genfromtxt() function, which you can find in the Further reading section. For our example, let's walk through the ones required to answer our business question:

  • The first parameter is the filename, which is assigned to the file we upload, named AAPL_stock_price_example.csv.
  • The second parameter is the delimiter, which is a comma since that is how the input file is structured.
  • The next parameter is to inform the function that our input data file has a header by assigning the names= parameter to True.
  • The last parameter is usecols=, which defines the specific column to read the data from.

According to the genformtxt() function help, when passing a value to the usecols= parameter, the first column is always assigned to 0 by default. Since we need the Close column in our file, we change the parameter value to 1 to match the order that is found in our input file.

Once the input_stock_price_array is loaded using the genfromtxt() function, a quicksize check will validate that the number of elements matches the number of rows in the source file. Note that the header row would be excluded from the size. In the following screenshot, you see a few modifications to the manual array option but once the array is populated with values, the remaining steps are very similar. I added [:5] to the print() function to displace the top five elements and make it easier to compare the source input array and the new sorted array:


The steps to reproduce this option are as follows:

  1. Launch Jupyter and create a new Python notebook.
  2. To stay consistent with best practices, be sure to rename the notebook highest_daily_closing_stock_price_option_2 before moving forward.
  3. Upload the AAPL_stock_price_example.csv file to the Jupyter notebook.
  4. Type inimport numpy as npin theIn []:cell.
  5. Run the cell.
  1. Type ininput_stock_price_array = np.genfromtxt('AAPL_stock_price_example.csv', delimiter=',', names=True, usecols = (1))in the nextIn []:cell.
  2. Run the cell.
  3. Type ininput_stock_price_array.sizein the nextIn []:cell.
  4. Verify that the output cell displaysOut []:. The number of rows is 229 when excluding the header row.
  5. Type insorted_stock_price_array = np.sort(input_stock_price_array)[::-1]in the nextIn []:cell.
  6. Run the cell.
  7. Type inprint('Closing stock price in order of day traded: ', input_stock_price_array[:5])
    print('Closing stock price in order from high to low: ', sorted_stock_price_array[:5])
    in the nextIn []:cell.
  8. Run the cell.
  9. Verify that the output cell displaysOut []:
  • There will be two rows of output with the first as the original array of values.
  • The second output row is a sorted list of the values from the array.
  1. Type inprint('Highest closing stock price: ', sorted_stock_price_array[0])in the nextIn []:cell.
  2. Run the cell.
  3. Verify that the output cell displaysOut []:. The output should state Highest closing stock price: 267.100006.

Assigning values to an array using a loop

Another approach that may use more code but has more flexibility to control data quality during the process of populating the array would be to use a loop. There are a few concepts to walk through using this approach but I think it will be useful to understand this and applicable to further learning exercises.

A summary of the process is as follows:

  1. Read the file into memory
  2. Loop through each individual record
  1. Strip out a value from each record
  2. Assign each value to a temporary array
  3. Clean up the array
  4. Sort the array in descending order
  5. Print the first element in the array to display the highest price

The last few steps in this process should look familiar since they are a repeat from the previous option where we clean the array, sort it, and then print the first element. The complete steps to reproduce this option are as follows:

  1. Launch Jupyter and create a new Python notebook.
  2. To stay consistent with best practices, be sure to rename the notebook highest_daily_closing_stock_price_option_3 before moving forward.
  3. Upload the AAPL_stock_price_example.csv file to the Jupyter notebook.
Be sure to upload the source CSV file in the correct file location so you can reference it in your Jupyter notebook.
  1. Type in the following command to import the numpy library in the notebook input, In []:, and run the cell. There will be no output after running this command:
In []: import numpy as np
  1. Initialize the array by cleaning out all of the values before we can populate it. There will be no output after running this command:
In []: temp_array = []
  1. In the following block of code, we have to execute multiple consecutive commands in a loop. The sequence is important and Jupyter will auto-indent as you type in theIn []:cell. I included comments to better understand the code. There will be no output after running this command:
#A. Read the file into memory
with open('AAPL_stock_price_example.csv', 'r') as input_file:

#B. load all the data into a variable
all_lines_from_input_file = input_file.readlines()

#C. Loop through each individual record
for each_individual_line in all_lines_from_input_file:

#D. Strip out a value from each record
for value_from_line in \
each_individual_line.rsplit(',')[1:]:

#E. Remove the whitespaces from each value
clean_value_from_line = \
value_from_line.replace("\n", "")

#F. Assign each value to the new array by element
temp_array.append(clean_value_from_line)
  1. After temp_array is populated with elements, a quick print() function identifies another data cleanup step that is required to move forward. Type in the following command in the nextIn []:cell and run the cell:
print(temp_array[:5])
  1. Verify that the output cell displays Out [], which will look similar to the following screenshot. The array includes a header row value of Closeand has single quotes around the price values:
  1. The header row from the source file has been included in our array, which is easy to remove by assigning the array to itself and using the delete() function to delete the first element. There will be no output after running this command:
temp_array = np.delete(temp_array,0)
  1. Use the size() function to confirm the size of the array matches the original source input file by adding the following commands running the cell:
temp_array.size
  1. Verify that the output cell displays Out [], which will look similar to the following screenshot. The number of rows is 229 when excluding the header row:
  1. The data type of the array has single quotes around each element. This can be remedied using a simple command from the astype() method by converting dtype of the array into float since the stock prices are decimal numeric values. There will be no output after running this command:
input_stock_price_array = temp_array.astype(float)
  1. Print the first few elements in the new array to verify the array has cleaned elements:
print(input_stock_price_array[:5])
  1. Verify the array now has only numeric values in decimal format and the quotes have been removed, similar to the following screenshot:
  1. The last few steps are a repeat from the prior exercise. We start with sorting the array in descending order using the sort() function along with passing a parameter of [::-1] to sort from high to low. Type in the following commandin the next In []: cell and run the cell. There will be no output after running this command:
sorted_stock_price_array = np.sort(input_stock_price_array)[::-1] 
  1. Print the first few elements in the array to display the highest price by referencing the first sorted element in sorted_stock_price_array using the print() function by typing in the commands and running the cell:
print('Closing stock price in order of day traded: ', input_stock_price_array[:5])
print('Closing stock price in order from high to low: ', sorted_stock_price_array[:5])
  1. Verify that the output cell displays Out []:
  • There will be two rows of output with the first as the original array of values.
  • The second output row is a sorted list of the values from the array.

This will look similar to the following screenshot:

  1. To see the highest price, use the print() function and use the [0] command against the sorted array to display the first value:
print('Highest closing stock price: ', sorted_stock_price_array[0])
  1. Verify that the output cell displays Out [], which will look similar to the following screenshot. The output should state Highest closing stock price: 267.100006: