Basics of RDD operation
Let's now go through some RDD operational basics. The best way to understand what something does is to look at the documentation so that we can get a rigorous understanding of what a function performs.
The reason why this is very important is that the documentation is the golden source of how a function is defined and what it is designed to be used as. By reading the documentation, we make sure that we are as close to the source as possible in our understanding. The link to the relevant documentation is https://spark.apache.org/docs/latest/rdd-programming-guide.html.
So, let's start with the map function. The map function returns an RDD by applying the f function to each element of this RDD. In other words, it works the same as the map function we see in Python. On the other hand, the filter function returns a new RDD containing only the elements that satisfy a predicate, and that predicate, which is a Boolean, is often returned by an f function fed into the filter function. Again, this works very similarly to the filter function in Python. Lastly, the collect() function returns a list that contains all the elements in this RDD. And this is where I think reading the documentation really shines, when we see notes like this. This would never come up in Stack Overflow or a blog post if you were simply googling what this is.
So, we're saying that collect() should only be used if the resulting array is expected to be small, as all the data is loaded in a driver's memory. What that means is, if we think back on Chapter 01, Installing PySpark and Setting Up Your Development Environment, Spark is superb because it can collect and parallelize data across many different unique machines, and have it transparently operatable from one Terminal. What collects notes is saying is that, if we call collect(), the resulting RDD would be completely loaded into the driver's memory, in which case we lose the benefits of distributing the data around a cluster of Spark instances.
Now that we know all of this, let's see how we actually apply these three functions to our data. So, go back to the PySpark Terminal; we have already loaded our raw data as a text file, as we have seen in previous chapters.
We will write a filter function to find all the lines to indicate RDD data, where each line contains the word normal, as seen in the following screenshot:
contains_normal = raw_data.filter(lambda line: "normal." in line)
Let's analyze what this means. Firstly, we are calling the filter function for the RDD raw data, and we're feeding it an anonymous lambda function that takes one line parameter and returns the predicates, as we have read in the documentation, on whether or not the word normal exists in the line. At this moment, as we have discussed in the previous chapters, we haven't actually computed this filter operation. What we need to do is call a function that actually consolidates the data and forces Spark to calculate something. In this case, we can count on contains_normal, as demonstrated in the following screenshot:
You can see that it has counted just over 970,000 lines in the raw data that contain the word normal. To use the filter function, we provide it with the lambda function and use a consolidating function, such as counts, that forces Spark to calculate and compute the data in the underlying DataFrame.
For the second example, we will use the map. Since we downloaded the KDD cup data, we know that it is a comma-separated value file, and so, one of the very easy things for us to do is to split each line by two commas, as follows:
split_file = raw_data.map(lambda line: line.split(","))
Let's analyze what is happening. We call the map function on raw_data. We feed it an anonymous lambda function called line, where we are splitting the line function by using ,. The result is a split file. Now, here the power of Spark really comes into play. Recall that, in the contains_normal. filter, when we called a function that forced Spark to calculate count, it took us a few minutes to come up with the correct results. If we perform the map function, it is going to have the same effect, because there are going to be millions of lines of data that we need to map through. And so, one of the ways to quickly preview whether our mapping function runs correctly is if we can materialize a few lines instead of the whole file.
To do this, we can use the take function that we have used before, as demonstrated in the following screenshot:
This might take a few seconds because we are only taking five lines, which is our splits and is actually quite manageable. If we look at this sample output, we can understand that our map function has been created successfully. The last thing we can do is to call collect() on raw data as follows:
raw_data.collect()
This is designed to move all of the raw data from Spark's RDD data structure into the memory.