An introduction to Apache Spark (PySpark API)
I have been wanting to write this post since the first time I conducted an Apache Spark workshop with Maria Mestre (her blog can be found here) and later with Erik Pazos. Finally, I have found some time to compile everything together and present in one concise post.
Apache Spark is a popular distributed computing engine that enables processing large datasets in parallel using a cluster of computers. The specific distributed computing paradigm implemented Apache Spark is mainly known as Map-Reduce framework. As the primary objective of this post is not delve into details of Map-Reduce framework, please refer to this paper if you want to learn more about how Map-Reduce works.
The popularity of Map-Reduce framework grew when Apache Hadoop project (with Hadoop MapReduce module) emerged. Hadoop MapReduce grew rapidly with a lot of players in the market adapting this technology to process TB scale data. Spark is a more recent entrant to the market. Having its inceptions at an academic project at UC Berkeley, Spark solved some of the performance drawbacks of Apache Hadoop and showed that outstanding performance gains can be achieved using in-memory computing capabilities in Spark in comparison to Hadoop that wrote intermediary data to disc after every step. Spark eventually evolved into a Apache open source project and became Apache Spark. With constant improvement and very short release cycles, Apache Spark has been improving rapidly ever since while capturing a reasonable market share in distributed computing market.
The core spark engine is mainly written in Scala programming language while the initial version provided three standard API interfaces to Apache Spark, namely Scala, Java and Python. The recent releases have expanded the API offering to also provide a R API positioning itself more attractive among use cases that involve lot of data science, statistics and big data.
What is PySpark?
PySpark is the python interface to Apache Spark distributed computing framework which has been catching a lot of traction lately due to the emergence of Big Data and Distributed computing frameworks that enable us to process and extract value from Big Data.
Python has been a popular choice for data science and machine learning in the near past. This is mainly due to some of the features in python that makes it very efficient and attractive when doing data science in the industry. Some of the reasons are as follows:
- Python is a production ready programming language that has been developed and matured for a while
- Python is a very expressive language which allows you to express programming logic very clearly in fewer lines of code compared to alot of languages in the industry today.
- Python has a rich eco-system that has feature rich libraries or scientific computing (scipy, numpy), data science & statistics (statmodels, matplotlib, pyMC, pandas), machine learning (scikit_learn, scikit_image, nltk)
- Python also has a rich library ecosystem to build industry-scale software products such as web frameworks (Flask, Django), database ORM engines (psycopg, sqlAlchemy) and utility libraries (json, ssl, regex) including standard APIs for popular cloud services such as GoogleCloud and AWS.
- The availability of a shell interface makes the development process very interactive.
- Research to Production transition is very smooth as the python ecosystem provides tools and libraries for both ends of the spectrum.
As said earlier PySpark is the python programming interface to Apache Spark cluster computing engine.
About the tutorial
This tutorial has been prepared to help someone take the first steps and familiarize themselves with Apache Spark Computing Engine using the python programming interface (pyspark).
This tutorial uses multiple resources to demonstrate different aspects of using Apache Spark.
- Presentation slides to explain the theoretical aspects and logical functionality of Apache Spark
- A Python Notebook with executable code that will allow you to run code and experience how Apache Spark works. The notebook is also a potential playground to set you imagination free and experiment your own additions to existing code
- A software environment that has some data, libraries to run the tutorial code
Structure of the Tutorial
The Tutorial is structured into three main parts.
- Spark RDD API : Spark RDD API is the core API that allows the user to do data transformation operations on rows in the data set in the raw form.
- The slides for this section explains how to Apache Sparks computing engine works and how the (key, value) records are manipulated using the RDD API.
- Spark DataFrame API and Spark SQL : DataFrame API is the enhanced API developed as part of the Apache Spark project to use the benefits of having schematic data to process data efficiently.
- The slides for this section shows how DataFrame API works under the hood and why it is recommended to use the DataFrame/ SQL interfaces to transform data where possible
- Final Remarks : This section outlines some of the other attractive features in Apache Spark.
- This section outlines some complementary projects that makes Apache Spark more attractive for data processing and machine learning.
- Some features in newer spark releases that complements manipulating data efficiently
- References to some papers and URLs that provide additional information that will help you better understand how Apache Spark works.
- Some references to where to go from here (Beginner to Expert 😀 )…
Presentation slides outline the theoretical aspects of the technology and logical explanations about how things work and why.
The presentation slides are found below:
The Jupyter notebook is a very useful tool to observe how pySpark syntax works. Furthermore, it can be used to experiment on existing Raw Data, Spark Dataframes, RDDs in the notebook to get a better understanding about how Spark works.
The notebook with all the required data files can be found in the following github repository:
Click here to go to the git hub repository.
Directions to set up the local environment to run the git hub repository is found in the below section: How to set up the environment
How to set up the environment
You will to set up the environment to run the tutorial in your local machine. To keep your local python environment uninterrupted, I highly recommend you setup a python virtual environment to run this code.
There are a few steps to setting up the environment. They are:
- Set up the virtual environment
- Download the Git hub repository
- Install the required libraries
- Download and set up Apache Spark
- FIRE AWAY !!!
Set up the virtual environment
- Go to the local directory you want to create the virtual environment
- Eg: /home/in4maniac/path/to/the/desired/venv/directory
- Create a virtual environment with desired name
- Eg: spark_tutorial_env
sudo virtualenv spark_tutorial_env
- Activate the virtual environment
Once the virtual environment is activated, you should see the virutal_environment_name within brackets in front for your shell terminal
Download the Git hub repository
Now that you have set up the virtual environment, you should download the git hub repository.
- Go to the local directory you want to clone the git hub repository
- Eg: /home/in4maniac/path/to/the/desired/github/clone/directory
- Clone the git repository
git clone firstname.lastname@example.org:mrm1001/spark_tutorial.git
Install the required libraries
We need to install the python libraries we need to run the tutorial. The following python libraries have to be installed.
- numpy : is a prerequisite for scikit learn
- scipy : is a prerequisite for scikit learn
- scikit learn : required to run the machine learning classifier
- ipython : required to power the notebook
- jupyter : required to run the notebook
The following code snippet will install the ideal versions of libraries :
cd /home/in4maniac/path/to/the/desired/github/clone/directory/spark_tutorial sudo pip install -r requirements.txt
Download and set up Apache Spark
The final building block is Apache Spark. For this tutorial, we use Apache Spark version 1.6.2.
There are multiple ways you can download and build Apache Spark (Refer here). For simplicity, we use pre-built Spark for this tutorial as it is the easiest to set up. We use Apache Spark version 1.6.2 pre-built for Hadoop 2.4.
- Go to the Apache Spark download page
- Select the right download file from the User Interface
- Choose a Spark release: 1.6.2 (Jun 25 2016)
- Choose a package type: Pre-built for Hadoop 2.4
- Choose a download type: Direct Download
- Download Spark: spark-1.6.2-bin-hadoop2.4.tgz
- You can download the file by clicking on the hyper link to the .tgz file
- Unzip the .tgz file to your desired directory
tar -xf spark-1.6.2-bin-hadoop2.4.tgz -C /home/in4maniac/path/to/the/desired/spark/directory/
- As this package is pre-built, Spark is ready to use as soon as you unzip the package 😀
FIRE AWAY !!!
Now that the environment is set, you can run the tutorial in the virtual environment.
- Go to the directory where the local copy of the git hub directory is (download location in section: Download the Git hub repository)
- Run pySpark with IPYTHON options to launch pyspark in a Jupyter Notebook
- Jupyter Notebook will launch in your default browser upon triggering the above command.
- Select “Spark_Tutorial.ipynb” and launch the notebook
- Start playing with the notebook
- You can use the option Cell >> All Output >> Clear option to clear all the cell outputs.
A video of the tutorial that was done in London a while ago is hosted in the following youtube URLs. Although some content might have changed over time, these links should be useful.
Maria Mestre on Spark RDDs
Sahan Bulathwela on DataFrames and Spark SQL
Hope you guys enjoy it !!