The Data Wrangling Workshop
上QQ阅读APP看书,第一时间看更新

NumPy Arrays

A NumPy array is similar to a list but differs in some ways. In the life of a data scientist, reading and manipulating an array is of prime importance, and it is also the most frequently encountered task. These arrays could be a one-dimensional list, a multi-dimensional table, or a matrix full of numbers and can be used for a variety of mathematical calculations.

An array could be filled with integers, floating-point numbers, Booleans, strings, or even mixed types. However, in the majority of cases, numeric data types are predominant. Some example scenarios where you will need to handle numeric arrays are as follows:

  • To read a list of phone numbers and postal codes and extract a certain pattern
  • To create a matrix with random numbers to run a Monte Carlo simulation on a statistical process
  • To scale and normalize a sales figure table, with lots of financial and transactional data
  • To create a smaller table of key descriptive statistics (for example, mean, median, min/max range, variance, and inter-quartile ranges) from a large raw data table
  • To read in and analyze time series data in a one-dimensional array daily, such as the stock price of an organization over a year or daily temperature data from a weather station

In short, arrays and numeric data tables are everywhere. As a data wrangling professional, the importance of the ability to read and process numeric arrays cannot be overstated. It is very common to work with data and need to modify it with a mathematical function. In this regard, NumPy arrays are the most important objects in Python that you need to know about.

NumPy Arrays and Features

NumPy and SciPy are open source add-on modules for Python that provide common mathematical and numerical routines in pre-compiled, fast functions. Over the years, these have grown into highly mature libraries that provide functionality that meets, or perhaps exceeds, what is associated with common commercial software such as Matlab or Mathematica.

One of the main advantages of the NumPy module is that it can be used to handle or create one-dimensional or multi-dimensional arrays. This advanced data structure/class is at the heart of the NumPy package and it serves as the fundamental building block of more advanced concepts, such as the pandas library and specifically, the pandas DataFrame, which we will cover shortly in this chapter.

NumPy arrays are different than common Python lists since Python lists can be thought of as simple arrays. NumPy arrays are built for mathematical vectorized operations that process a lot of numerical data with just a single line of code. Many built-in mathematical functions in NumPy arrays are written in low-level languages such as C or Fortran and are pre-compiled for really fast execution.

Note

NumPy arrays are optimized data structures for numerical analysis, and that's why they are so important to data scientists.

Let's go through the first exercise in this chapter, where we will learn how to create a NumPy array from a list.

Exercise 3.01: Creating a NumPy Array (from a List)

In this exercise, we will create a NumPy array from a list. We're going to define a list first and use the array function of the NumPy library to convert the list into an array. Next, we'll read from a .csv file and store the data in a NumPy array using the genfromtxt function of the NumPy library. To do so, let's go through the following steps:

  1. To work with NumPy, we must import it. By convention, we give it a short name, np, while importing it. This will make referencing the objects under the NumPy package organized:

    import numpy as np

  2. Create a list with three elements: 1, 2, and 3:

    list_1 = [1,2,3]

    list_1

    The output is as follows:

    [1, 2, 3]

  3. Use the array function to convert it into an array:

    array_1 = np.array(list_1)

    array_1

    The output is as follows:

    array([1, 2, 3])

    We just created a NumPy array object called array_1 from the regular Python list object, list_1.

  4. Create an array of floating type elements, that is, 1.2, 3.4, and 5.6, using the array function directly:

    a = np.array([1.2, 3.4, 5.6])

    a

    The output is as follows:

    array([1.2, 3.4, 5.6])

  5. Let's check the type of the newly created object, a, using the type function:

    type(a)

    The output is as follows:

    numpy.ndarray

  6. Use the type function to check the type of array_1:

    type(array_1)

    The output is as follows:

    numpy.ndarray

    As we can see, both a and array_1 are NumPy arrays.

  7. Now, use type on list_1:

    type(list_1)

    The output is as follows:

    list

    As we can see, list_1 is essentially a Python list and we have used the array function of the NumPy library to create a NumPy array from that list.

  8. Now, let's read a .csv file as a NumPy array using the genfromtxt function of the NumPy library:

    data = np.genfromtxt('../datasets/stock.csv', \

                         delimiter=',',names=True,dtype=None, \

                         encoding='ascii')

    data

    Note

    The path (highlighted) should be specified based on the location of the file on your system. The stock.csv file can be found here: https://packt.live/2YK0XB2.

    The partial output is as follows:

    array([('MMM', 100), ('AOS', 101), ('ABT', 102), ('ABBV', 103),

           ('ACN', 104), ('ATVI', 105), ('AYI', 106), ('ADBE', 107),

           ('AAP', 108), ('AMD', 109), ('AES', 110), ('AET', 111),

           ('AMG', 112), ('AFL', 113), ('A', 114), ('APD', 115),

           ('AKAM', 116), ('ALK', 117), ('ALB', 118), ('ARE', 119),

           ('ALXN', 120), ('ALGN', 121), ('ALLE', 122), ('AGN', 123),

           ('ADS', 124), ('LNT', 125), ('ALL', 126), ('GOOGL', 127),

           ('GOOG', 128), ('MO', 129), ('AMZN', 130), ('AEE', 131),

           ('AAL', 132), ('AEP', 133), ('AXP', 134), ('AIG', 135),

           ('AMT', 136), ('AWK', 137), ('AMP', 138), ('ABC', 139),

           ('AME', 140), ('AMGN', 141), ('APH', 142), ('APC', 143),

           ('ADI', 144), ('ANDV', 145), ('ANSS', 146), ('ANTM', 147),

           ('AON', 148)], dtype=[('Symbol', '<U5'), ('Price', '<i8')])

  9. Use the type function to check the type of data:

    type(data)

    The output is as follows:

    numpy.ndarray

As we can see, the data variable is also a NumPy array.

Note

To access the source code for this specific section, please refer to https://packt.live/2Y9pTTx.

You can also run this example online at https://packt.live/2URNcPz.

From this exercise, we can observe that the NumPy array is different from the regular list object. The most important point to keep in mind is that NumPy arrays do not have the same methods as lists and that they are essentially designed for mathematical functions.

NumPy arrays are like mathematical objects – vectors. They are built for element-wise operations, that is, when we add two NumPy arrays, we add the first element of the first array to the first element of the second array – there is an element-to-element correspondence in this operation. This is in contrast to Python lists, where the elements are simply appended and there is no element-to-element relation. This is the real power of a NumPy array: they can be treated just like mathematical vectors.

A vector is a collection of numbers that can represent, for example, the coordinates of points in a three-dimensional space or the color of numbers (RGB) in a picture. Naturally, relative order is important for such a collection and as we discussed previously, a NumPy array can maintain such order relationships. That's why they are perfectly suitable to use in numerical computations.

With this knowledge, we're going to perform the addition operation on NumPy arrays in the next exercise.

Exercise 3.02: Adding Two NumPy Arrays

This simple exercise will demonstrate the addition of two NumPy arrays using the + notation, and thereby show the key difference between a regular Python list/array and a NumPy array. Let's perform the following steps:

  1. Import the NumPy library:

    import numpy as np

  2. Declare a Python list called list_1 and a NumPy array:

    list_1 = [1,2,3]

    array_1 = np.array(list_1)

  3. Use the + notation to concatenate two list_1 objects and save the results in list_2:

    list_2 = list_1 + list_1

    list_2

    The output is as follows:

    [1, 2, 3, 1, 2, 3]

  4. Use the same + notation to concatenate two array_1 objects and save the result in array_2:

    array_2 = array_1 + array_1

    array_2

    The output is as follows:

    [2 ,4, 6]

  5. Load a .csv file and concatenate it with itself:

    data = np.genfromtxt('../datasets/numbers.csv', \

                         delimiter=',', names=True)

    data = data.astype('float64')

    data + data

    Note

    The path (highlighted) should be specified based on the location of the file on your system. The .csv file that will be used is numbers.csv; this can be found at: https://packt.live/30Om2wC.

    The output is as follows:

    array([202., 204., 206., 208., 210., 212., 214., 216., 218.,

           220., 222., 224., 226., 228., 230., 232., 234., 236.,

           238., 240., 242., 244., 246., 248., 250., 252., 254.,

           256., 258., 260., 262., 264., 266., 268., 270., 272.,

           274., 276., 278., 280., 282., 284., 286., 288., 290.,

           292., 294., 296.])

Did you notice the difference? The first print shows a list with 6 elements, [1, 2, 3, 1, 2, 3], but the second print shows another NumPy array (or vector) with the elements [2, 4, 6], which are just the sum of the inpidual elements of array_1. As we discussed earlier, NumPy arrays are perfectly designed to perform element-wise operations since there is element-to-element correspondence.

Note

To access the source code for this specific section, please refer to https://packt.live/3fyvSqF.

You can also run this example online at https://packt.live/3fvUDnf

NumPy arrays even support element-wise exponentiation. For example, suppose there are two arrays – the elements of the first array will be raised to the power of the elements in the second array.

In the following exercise, we will try out some mathematical operations on NumPy arrays.

Exercise 3.03: Mathematical Operations on NumPy Arrays

In this exercise, we'll generate a NumPy array with the values extracted from a .csv file. We'll be using the multiplication and pision operators on the generated NumPy array. Let's go through the following steps:

Note

The .csv file that will be used is numbers.csv; this can be found at: https://packt.live/30Om2wC.

  1. Import the NumPy library and create a NumPy array from the .csv file:

    import numpy as np

    data = np.genfromtxt('../datasets/numbers.csv', \

                         delimiter=',', names=True)

    data = data.astype('float64')

    data

    Note

    Don't forget to change the path (highlighted) based on the location of the file on your system.

    The output is as follows:

    array([101., 102., 103., 104., 105., 106., 107., 108., 109.,

           110., 111., 112., 113., 114., 115., 116., 117., 118.,

           119., 120., 121., 122., 123., 124., 125., 126., 127.,

           128., 129., 130., 131., 132., 133., 134., 135., 136.,

           137., 138., 139., 140., 141., 142., 143., 144., 145.,

           146., 147., 148.])

  2. Multiply 45 by every element in the array:

    data * 45

    The output is as follows:

    array([4545., 4590., 4635., 4680., 4725., 4770., 4815., 4860.,

           4905., 4950., 4995., 5040., 5085., 5130., 5175., 5220.,

           5265., 5310., 5355., 5400., 5445., 5490., 5535., 5580.,

           5625., 5670., 5715., 5760., 5805., 5850., 5895., 5940.,

           5985., 6030., 6075., 6120., 6165., 6210., 6255., 6300.,

           6345., 6390., 6435., 6480., 6525., 6570., 6615., 6660.])

  3. Divide the array by 67.7:

    data / 67.7

    The output is as follows:

    array([1.49187592, 1.50664697, 1.52141802, 1.53618907,

           1.55096012, 1.56573117, 1.58050222, 1.59527326,

           1.61004431, 1.62481536, 1.63958641, 1.65435746,

           1.66912851, 1.68389956, 1.69867061, 1.71344165,

           1.7282127 , 1.74298375, 1.7577548 , 1.77252585,

           1.7872969 , 1.80206795, 1.816839 , 1.83161004,

           1.84638109, 1.86115214, 1.87592319, 1.89069424,

           1.90546529, 1.92023634, 1.93500739, 1.94977843,

           1.96454948, 1.97932053, 1.99409158, 2.00886263,

           2.02363368, 2.03840473, 2.05317578, 2.06794682,

           2.08271787, 2.09748892, 2.11225997, 2.12703102,

           2.14180207, 2.15657312, 2.17134417, 2.18611521])

  4. Raise one array to the second array's power using the following command:

    list_1 = [1,2,3]

    array_1 = np.array(list_1)

    print("array_1 raised to the power of array_1: ", \

          array_1**array_1)

    The output is as follows:

    array_1 raised to the power of array_1: [ 1 4 27]

Thus, we can observe how NumPy arrays allow element-wise exponentiation.

Note

To access the source code for this specific section, please refer to https://packt.live/3hBZMw4.

You can also run this example online at https://packt.live/2N4dE3Y.

In the next section, we'll discuss how to apply advanced mathematical operations to NumPy arrays.