上QQ阅读APP看书，第一时间看更新

How it works...

This section explains the reasoning behind the installation process for Python, Anaconda, and Spark.

Spark runs on the Java virtual machine (JVM), the Java Software Development Kit (SDK) is a prerequisite installation for Spark to run on an Ubuntu virtual machine.

In order for Spark to run on a local machine or in a cluster, a minimum version of Java 6 is required for installation.

Ubuntu recommends the sudo apt install method for Java as it ensures that packages downloaded are up to date.
Please note that if Java is not currently installed, the output in the terminal will show the following message:

The program 'java' can be found in the following packages:
* default-jre
* gcj-5-jre-headless
* openjdk-8-jre-headless
* gcj-4.8-jre-headless
* gcj-4.9-jre-headless
* openjdk-9-jre-headless
Try: sudo apt install <selected package>

While Python 2 is fine, it is considered legacy Python. Python 2 is facing an end of life date in 2020; therefore, it is recommended that all new Python development be performed with Python 3, as will be the case in this publication. Up until recently, Spark was only available with Python 2. That is no longer the case. Spark works with both Python 2 and 3. A convenient way to install Python 3, as well as many dependencies and libraries, is through Anaconda. Anaconda is a free and open source distribution of Python, as well as R. Anaconda manages the installation and maintenance of many of the most common packages used in Python for data science-related tasks.
During the installation process for Anaconda, it is important to confirm the following conditions:
- Anaconda is installed in the /home/username/Anaconda3 location
- The Anaconda installer prepends the Anaconda3 install location to a PATH in /home/username/.bashrc

After Anaconda has been installed, download Spark. Unlike Python, Spark does not come preinstalled on Ubuntu and therefore, will need to be downloaded and installed.
For the purposes of development with deep learning, the following preferences will be selected for Spark:
- Spark release: 2.2.0 (Jul 11 2017)
- Package type: Prebuilt for Apache Hadoop 2.7 and later
- Download type: Direct download
Once Spark has been successfully installed, the output from executing Spark at the command line should look something similar to that shown in the following screenshot:

Two important features to note when initializing Spark are that it is under the Python 3.6.1 | Anaconda 4.4.0 (64-bit) | framework and that the Spark logo is version 2.2.0.
Congratulations! Spark is successfully installed on the local Ubuntu virtual machine. But, not everything is complete. Spark development is best when Spark code can be executed within a Jupyter notebook, especially for deep learning. Thankfully, Jupyter has been installed with the Anaconda distribution performed earlier in this section.