Scientific Computing and NumPy Basics
The term scientific computing has been used several times in this workshop so far; in the broadest sense of the term, it denotes the process of using computer programs (or anything with computing capabilities) to model and solve a specific problem in mathematics, engineering, or science. Examples may include mathematical models to look for and analyze patterns and trends in biological and social data, or machine learning models to make future predictions using economic data. As you may have already noticed, this definition has a significant overlap with the general fields of data science, and sometimes the terms are even used interchangeably.
The main workhorse of many (if not most) scientific computing projects in Python is the NumPy library. Since NumPy is an external library that does not come preinstalled with Python, we need to download and install it. As you may already know, installing external libraries and packages in Python can be done easily using package managers such as pip or Anaconda.
From your Terminal, run the following command to use pip to install NumPy in your Python environment:
$ pip install numpy
If you are currently in an Anaconda environment, you can run the following command instead:
$ conda install numpy
With these simple commands, all the necessary steps in the installation process are taken care of for us.
Some of NumPy's most powerful capabilities include vectorized, multi-dimensional array representations of objects; implementation of a wide range of linear algebraic functions and transformations; and random sampling. We will cover all of these topics in this section, starting with the general concept of arrays.
NumPy Arrays
We have actually already come across the concept of an array in the previous chapter, when we discussed Python lists. In general, an array is also a sequence of different elements that can be accessed inpidually or manipulated as a whole. As such, NumPy arrays are very similar to Python lists; in fact, the most common way to declare a NumPy array is to pass a Python list to the numpy.array() method, as illustrated here:
>>> import numpy as np
>>> a = np.array([1, 2, 3])
>>> a
array([1, 2, 3])
>>> a[1]
2
The biggest difference we need to keep in mind is that elements in a NumPy array need to be of the same type. For example, here, we are trying to create an array with two numbers and a string, which causes NumPy to forcibly convert all elements in the array into strings (the <U21 data type denotes the Unicode strings with fewer than 21 characters):
>>> b = np.array([1, 2, 'a'])
>>> b
array(['1', '2', 'a'], dtype='<U21')
Similar to the way we can create multi-dimensional Python lists, NumPy arrays support the same option:
>>> c = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
>>> c
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
Note
While working with NumPy, we often refer to multi-dimensional arrays as matrices.
Apart from initialization from Python lists, we can create NumPy arrays that are in a specific form. In particular, a matrix full of zeros or ones can be initialized using np.zeros() and np.ones(), respectively, with a given dimension and data type. Let's have a look at an example:
>>> zero_array = np.zeros((2, 2)) # 2 by 2 zero matrix
>>> zero_array
array([[0., 0.],
[0., 0.]])
Here, the tuple (2, 2) specifies that the array (or matrix) being initialized should have a two-by-two dimension. As we can see by the dots after the zeros, the default data type of a NumPy array is a float and can be further specified using the dtype argument:
>>> one_array = np.ones((2, 2, 3), dtype=int) # 3D one integer matrix
>>> one_array
array([[[1, 1, 1],
[1, 1, 1]],
[[1, 1, 1],
[1, 1, 1]]])
All-zero or all-one matrices are common objects in mathematics and statistics, so these API calls will prove to be quite useful later on. Now, let's look at a common matrix object whose elements are all random numbers. Using np.random.rand(), we can create a matrix of a given shape, whose elements are uniformly sampled between 0 (inclusive) and 1 (exclusive):
>>> rand_array = np.random.rand(2, 3)
>>> rand_array
array([[0.90581261, 0.88732623, 0.291661 ],
[0.44705149, 0.25966191, 0.73547706]])
Notice here that we are not passing the desired shape of our matrix as a tuple anymore, but as inpidual parameters of the np.random.rand() function instead.
If you are not familiar with the concept of randomness and random sampling from various distributions, don't worry, as we will cover that topic later on in this chapter as well. For now, let's move forward with our discussion about NumPy arrays, particularly about indexing and slicing.
You will recall that in order to access inpidual elements in a Python list, we pass its index inside square brackets next to the list variable; the same goes for one-dimensional NumPy arrays:
>>> a = np.array([1, 2, 3])
>>> a[0]
1
>>> a[1]
2
However, when an array is multi-dimensional, instead of using multiple square brackets to access subarrays, we simply need to separate the inpidual indices using commas. For example, we access the element in the second row and the second column of a three-by-three matrix as follows:
>>> b = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
>>> b
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
>>> b[1, 1]
5
Slicing NumPy arrays can be done in the same way: using commas. This syntax is very useful in terms of helping us access submatrices with more than one dimension in a matrix:
>>> a = np.random.rand(2, 3, 4) # random 2-by-3-by-4 matrix
>>> a
array([[[0.54376986, 0.00244875, 0.74179644, 0.14304955],
[0.77229612, 0.32254451, 0.0778769 , 0.2832851 ],
[0.26492963, 0.5217093 , 0.68267418, 0.29538502]],
[[0.94479229, 0.28608588, 0.52837161, 0.18493272],
[0.08970716, 0.00239815, 0.80097454, 0.74721516],
[0.70845696, 0.09788526, 0.98864408, 0.82521871]]])
>>> a[1, 0: 2, 1:]
array([[0.28608588, 0.52837161, 0.18493272],
[0.00239815, 0.80097454, 0.74721516]])
In the preceding example, a[1, 0: 2, 1:] helps us to access the numbers in the original matrix, a; that is, in the second element in the first axis (corresponding to index 1), the first two elements in the second axis (corresponding to 0: 2), and the last three elements in the third axis (corresponding to 1:). This option is one reason why NumPy arrays are more powerful and flexible than Python lists, which do not support multi-dimensional indexing and slicing, as we have demonstrated.
Finally, another important syntax to manipulate NumPy arrays is the np.reshape() function, which, as its name suggests, changes the shape of a given NumPy array. The need for this functionality can arise on multiple occasions: when we need to display an array in a certain way for better readability, or when we need to pass an array to a built-in function that only takes in arrays of a certain shape.
We can explore the effect of this function in the following code snippet:
>>> a
array([[[0.54376986, 0.00244875, 0.74179644, 0.14304955],
[0.77229612, 0.32254451, 0.0778769 , 0.2832851 ],
[0.26492963, 0.5217093 , 0.68267418, 0.29538502]],
[[0.94479229, 0.28608588, 0.52837161, 0.18493272],
[0.08970716, 0.00239815, 0.80097454, 0.74721516],
[0.70845696, 0.09788526, 0.98864408, 0.82521871]]])
>>> a.shape
(2, 3, 4)
>>> np.reshape(a, (3, 2, 4))
array([[[0.54376986, 0.00244875, 0.74179644, 0.14304955],
[0.77229612, 0.32254451, 0.0778769 , 0.2832851 ]],
[[0.26492963, 0.5217093 , 0.68267418, 0.29538502],
[0.94479229, 0.28608588, 0.52837161, 0.18493272]],
[[0.08970716, 0.00239815, 0.80097454, 0.74721516],
[0.70845696, 0.09788526, 0.98864408, 0.82521871]]])
Note that the np.reshape() function does not mutate the array that is passed in-place; instead, it returns a copy of the original array with the new shape without modifying the original. We can also assign this returned value to a variable.
Additionally, notice that while the original shape of the array is (2, 3, 4), we changed it to (3, 2, 4). This can only be done when the total numbers of elements resulting from the two shapes are the same (2 x 3 x 4 = 3 x 2 x 4 = 24). An error will be raised if the new shape does not correspond to the original shape of an array in this way, as shown here:
>>> np.reshape(a, (3, 3, 3))
-------------------------------------------------------------------------
ValueError Traceback (most recent call last)
...
ValueError: cannot reshape array of size 24 into shape (3,3,3)
Speaking of reshaping a NumPy array, transposing a matrix is a special form of reshaping that flips the elements in the matrix along its diagonal. Computing the transpose of a matrix is a common task in mathematics and machine learning. The transpose of a NumPy array can be computed using the [array].T syntax. For example, when we run a.T in the Terminal, we get the transpose of matrix a, as follows:
>>> a.T
array([[[0.54376986, 0.94479229],
[0.77229612, 0.08970716],
[0.26492963, 0.70845696]],
[[0.00244875, 0.28608588],
[0.32254451, 0.00239815],
[0.5217093 , 0.09788526]],
[[0.74179644, 0.52837161],
[0.0778769 , 0.80097454],
[0.68267418, 0.98864408]],
[[0.14304955, 0.18493272],
[0.2832851 , 0.74721516],
[0.29538502, 0.82521871]]])
And with that, we can conclude our introduction to NumPy arrays. In the next section, we will learn about another concept that goes hand in hand with NumPy arrays: vectorization.
Vectorization
In the broadest sense, the term vectorization in computer science denotes the process of applying a mathematical operation to an array (in a general sense) element by element. For example, an add operation where every element in an array is added to the same term is a vectorized operation; the same goes for vectorized multiplication, where all elements in an array are multiplied by the same term. In general, vectorization is achieved when all array elements are put through the same function.
Vectorization is done by default when an applicable operation is performed on a NumPy array (or multiple arrays). This includes binary functions such as addition, subtraction, multiplication, pision, power, and mod, as well as several unary built-in functions in NumPy, such as absolute value, square root, trigonometric functions, logarithmic functions, and exponential functions.
Before we see vectorization in NumPy in action, it is worth discussing the importance of vectorization and its role in NumPy. As we mentioned previously, vectorization is generally the application of a common operation on the elements in an array. Due to the repeatability of the process, a vectorized operation can be optimized to be more efficient than its alternative implementation in, say, a for loop. However, the trade-off for this capability is that the elements in the array would need to be of the same data type—this is also a requirement for any NumPy array.
With that, let's move on to the following exercise, where we will see this effect in action.
Exercise 2.01: Timing Vectorized Operations in NumPy
In this exercise, we will calculate the speedup achieved by implementing various vectorized operations such as addition, multiplication, and square root calculation with NumPy arrays compared to a pure Python alternative without vectorization. To do this, perform the following steps:
- In the first cell of a new Jupyter notebook, import the NumPy package and the Timer class from the timeit library. The latter will be used to implement our timing functionality:
import numpy as np
from timeit import Timer
- In a new cell, initialize a Python list containing numbers ranging from 0 (inclusive) to 1,000,000 (exclusive) using the range() function, as well as its NumPy array counterpart using the np.array() function:
my_list = list(range(10 ** 6))
my_array = np.array(my_list)
- We will now apply mathematical operations to this list and array in the following steps. In a new cell, write a function named for_add() that returns a list whose elements are the elements in the my_list variable with 1 added to each (we will use list comprehension for this). Write another function named vec_add() that returns the NumPy array version of the same data, which is simply my_array + 1:
def for_add():
return [item + 1 for item in my_list]
def vec_add():
return my_array + 1
- In the next code cell, initialize two Timer objects while passing in each of the preceding two functions. These objects contain the interface that we will use to keep track of the speed of the functions.
Call the repeat() function on each of the objects with the arguments 10 and 10—in essence, we are repeating the timing experiment by 100 times. Finally, as the repeat() function returns a list of numbers representing how much time passed in each experiment for a given function we are recording, we print out the minimum of this list. In short, we want the time of the fastest run of each of the functions:
print('For-loop addition:')
print(min(Timer(for_add).repeat(10, 10)))
print('Vectorized addition:')
print(min(Timer(vec_add).repeat(10, 10)))
The following is the output that this program produced:
For-loop addition:
0.5640330809999909
Vectorized addition:
0.006047582000007878
While yours might be different, the relationship between the two numbers should be clear: the speed of the for loop addition function should be many times lower than that of the vectorized addition function.
- In the next code cell, implement the same comparison of speed where we multiply the numbers by 2. For the NumPy array, simply return my_array * 2:
def for_mul():
return [item * 2 for item in my_list]
def vec_mul():
return my_array * 2
print('For-loop multiplication:')
print(min(Timer(for_mul).repeat(10, 10)))
print('Vectorized multiplication:')
print(min(Timer(vec_mul).repeat(10, 10)))
Verify from the output that the vectorized multiplication function is also faster than the for loop version. The output after running this code is as follows:
For-loop multiplication: 0.5431750800000259
Vectorized multiplication: 0.005795304000002943
- In the next code cell, implement the same comparison where we compute the square root of the numbers. For the Python list, import and use the math.sqrt() function on each element in the list comprehension. For the NumPy array, return the expression np.sqrt(my_array):
import math
def for_sqrt():
return [math.sqrt(item) for item in my_list]
def vec_sqrt():
return np.sqrt(my_array)
print('For-loop square root:')
print(min(Timer(for_sqrt).repeat(10, 10)))
print('Vectorized square root:')
print(min(Timer(vec_sqrt).repeat(10, 10)))
Verify from the output that the vectorized square root function is once again faster than its for loop counterpart:
For-loop square root:
1.1018582749999268
Vectorized square root:
0.01677640299999439
Also, notice that the np.sqrt() function is implemented to be vectorized, which is why we were able to pass the whole array to the function.
This exercise introduced a few vectorized operations for NumPy arrays and demonstrated how much faster they are compared to their pure Python loop counterparts.
Note
To access the source code for this specific section, please refer to https://packt.live/38l3Nk7.
You can also run this example online at https://packt.live/2ZtBSdY.
That concludes the topic of vectorization in NumPy. In the next and final section on NumPy, we'll discuss another powerful feature that the package offers: random sampling.
Random Sampling
In the previous chapter, we saw an example of how to implement randomization in Python using the random library. However, the randomization in most of the methods implemented in that library is uniform, and in scientific computing and data science projects, sometimes, we need to draw samples from distributions other than the uniform one. This area is where NumPy once again offers a wide range of options.
Generally speaking, random sampling from a probability distribution is the process of selecting an instance from that probability distribution, where elements having a higher probability are more likely to be selected (or drawn). This concept is closely tied to the concept of a random variable in statistics. A random variable is typically used to model some unknown quantity in a statistical analysis, and it usually follows a given distribution, depending on what type of data it models. For example, the ages of members of a population are typically modeled using the normal distribution (also known as the bell curve or the Gaussian distribution), while the arrivals of customers to, say, a bank are often modeled using the Poisson distribution.
By randomly sampling a given distribution that is associated with a random variable, we can obtain an actual realization of the variable, from which we can perform various computations to obtain insights and inferences about the random variable in question.
We will revisit the concept and usage of probability distributions later in this book. For now, let's simply focus on the task at hand: how to draw samples from these distributions. This is done using the np.random package, which includes the interface that allows us to draw from various distributions.
For example, the following code snippet initializes a sample from the normal distribution (note that your output might be different from the following due to randomness):
>>> sample = np.random.normal()
>>> sample
-0.43658969989465696
You might be aware of the fact that the normal distribution is specified by two statistics: a mean and a standard deviation. These can be specified using the loc (whose default value is 0.0) and scale (whose default value is 1.0) arguments, respectively, in the np.random.normal() function, as follows:
>>> sample = np.random.normal(loc=100, scale=10)
>>> sample
80.31187658687652
It is also possible to draw multiple samples, as opposed to just a single sample, at once as a NumPy array. To do this, we specify the size argument of the np.random.normal() function with the desired shape of the output array. For example, here, we are creating a 2 x 3 matrix of samples drawn from the same normal distribution:
>>> samples = np.random.normal(loc=100, scale=10, size=(2, 3))
>>> samples
array([[ 82.7834678 , 109.16410976, 101.35105681],
[112.54825751, 107.79073472, 77.70239823]])
This option allows us to take the output array and potentially apply other NumPy-specific operations to it (such as vectorization). The alternative is to sequentially draw inpidual samples into a list and convert it into a NumPy array afterward.
It is important to note that each probability distribution has its own statistic(s) that define it. The normal distribution, as we have seen, has a mean and a standard deviation, while the aforementioned Poisson distribution is defined with a λ (lambda) parameter, which is interpreted as the expectation of interval. Let's see this in an example:
>>> samples = np.random.poisson(lam=10, size=(2, 2))
>>> samples
array([[11, 10],
[15, 11]])
Generally, before drawing a sample from a probability distribution in NumPy, you should always look up the corresponding documentation to see what arguments are available for that specific distribution and what their default values are.
Aside from probability distribution, NumPy also offers other randomness-related functionalities that can be found in the random module. For example, the np.random.randint() function returns a random integer between two given numbers; np.random.choice() randomly draws a sample from a given one-dimensional array; np.random.shuffle(), on the other hand, randomly shuffles a given sequence in-place.
These functionalities, which are demonstrated in the following code snippet, offer a significant degree of flexibility in terms of working with randomness in Python in general, and specifically in scientific computing:
>>> np.random.randint(low=0, high=10, size=(2, 5))
array([[6, 4, 1, 3, 6],
[0, 8, 8, 8, 8]])
>>> np.random.choice([1, 3, 4, -6], size=(2, 2))
array([[1, 1],
[1, 4]])
>>> a = [1, 2, 3, 4]
>>> for _ in range(3):
... np.random.shuffle(a)
... print(a)
[4, 1, 3, 2]
[4, 1, 2, 3]
[1, 2, 4, 3]
A final important topic that we need to discuss whenever there is randomness involved in programming is reproducibility. This term denotes the ability to obtain the same result from a program in a different run, especially when there are randomness-related elements in that program.
Reproducibility is essential when a bug exists in a program but only manifests itself in certain random cases. By forcing the program to generate the same random numbers every time it executes, we have another way to narrow down and identify this kind of bug aside from unit testing.
In data science and statistics, reproducibility is of the utmost importance. Without a program being reproducible, it is possible for one researcher to find a statistically significant result while another is unable to, even when the two have the same code and methods. This is why many practitioners have begun placing heavy emphasis on reproducibility in the fields of data science and machine learning.
The most common method to implement reproducibility (which is also the easiest to program) is to simply fix the seed of the program (specifically its libraries) that utilizes randomness. Fixing the seed of a randomness-related library ensures that the same random numbers will be generated across different runs of the same program. In other words, this allows for the same result to be produced, even if a program is run multiple times on different machines.
To do this, we can simply pass an integer to the appropriate seed function of the library/package that produces randomness for our programs. For example, to fix the seed for the random library, we can write the following code:
>>> import random
>>> random.seed(0) # can use any other number
For the random package in NumPy, we can write the following:
>>> np.random.seed(0)
Setting the seed for these libraries/packages is generally a good practice when you are contributing to a group or an open source project; again, it ensures that all members of the team are able to achieve the same result and eliminates miscommunication.
This topic also concludes our discussion of the NumPy library. Next, we will move on to another integral part of the data science and scientific computing ecosystem in Python: the pandas library.