How’s it going there?
Jupyter Notebook is a popular application that enables you to edit, run and share Python code into a web view. It allows you to modify and re-execute parts of your code in a very flexible way. That’s why Jupyter is a great tool to test and prototype programs.
Apache Spark is a fast and powerful framework that provides an API to perform massively distributed processing over resilient sets of data.
Get Started with Spark and Jupyter together.
Install Spark
visit the Spark downloads page. Select the latest Spark release, a prebuilt package for Hadoop, and download it directly.
Unzip it and move it to your /opt folder:
1 2 | $ tar -xzf spark-1.2.0-bin-hadoop2.4.tgz $ mv spark-1.2.0-bin-hadoop2.4 /opt/spark-1.2.0 |
Create a symbolic link:
1 | $ ln -s /opt/spark-1.2.0 /opt/spark |
This way, you will be able to download and use multiple Spark versions.
Finally, tell your bash (or zsh, etc.) where to find Spark. To do so, configure your $PATH variables by adding the following lines in your ~/.bashrc (or~/.zshrc) file:
1 2 | $ export SPARK_HOME=/opt/spark $ export PATH=$SPARK_HOME/bin:$PATH |
Install Jupyter
1 | $ pip install jupyter |
You can run a regular jupyter notebook by typing:
1 | $ jupyter notebook |
There are two ways to get PySpark available in a Jupyter Notebook:
1 – Configure PySpark driver to use Jupyter Notebook: running pyspark will automatically open a Jupyter Notebook
2 – Load a regular Jupyter Notebook and load PySpark using findSpark package
Option 1:
Update PySpark driver environment variables: add these lines to your~/.bashrc (or ~/.zshrc) file.
1 2 | $ export PYSPARK_DRIVER_PYTHON=jupyter $ export PYSPARK_DRIVER_PYTHON_OPTS='notebook' |
Restart your terminal and launch PySpark again:
1 | $ pyspark |
Now, this command should start a Jupyter Notebook in your web browser. Create a new notebook by clicking on ‘New’ > ‘Notebooks Python [default]’.
Option 2:
Use findSpark package to make a Spark Context available in your code.
findSpark package is not specific to Jupyter Notebook, you can use this trick in your favorite IDE too.
To install findspark:
1 | $ pip install findspark |
Launch a regular Jupyter Notebook:
1 | $ jupyter notebook |
In your python code you need to add:
1 2 | import findspark findspark.init(“/path_to_spark”) |
Now you can try out and see. I hope this guide will help you easily get started with Jupyter and Spark
Here is a python code example to test:
1 2 3 4 5 6 7 8 9 10 11 12 13 | import findspark findspark.init(“/opt/spark-1.4.1-bin-hadoop2.6/”) import pyspark import random sc = pyspark.SparkContext(appName="Pi") num_samples = 100000000 def inside(p): x, y = random.random(), random.random() return x*x + y*y < 1 count = sc.parallelize(range(0, num_samples)).filter(inside).count() pi = 4 * count / num_samples print(pi) sc.stop() |
Apache Toree is a kernel for the Jupyter Notebook platform providing interactive access to Apache Spark.
Install Toree.
1 | $ sudo pip install toree |
Configure
Set SPARK_HOME to point to the directory where you downloaded and expanded the Spark binaries.
1 2 3 4 | $ SPARK_HOME=$HOME/Downloads/spark-x.x.x-bin-hadoopx.x $ jupyter toree install \ --spark_home=$SPARK_HOME |
Start notebook.
1 | $ jupyter notebook |
Test
Point browser to http://localhost:8888.
Then open a new notebook using New > Toree.
Test notebook with simple Spark Scala code.
1 2 3 4 | sc.parallelize(1 to 100). filter(x => x % 2 == 0). map(x => x * x). take(10) |
Here you can use tab for auto-complete.
To run Jupyter with R
Install IRkernel
1 | $ conda install -c r ipython-notebook r-irkernel |
You can now open R and Install some necessary packages used by R kernel on ipython notebook
1 | install.packages(c('rzmq','repr','IRkernel','IRdisplay'), repos = 'http://irkernel.github.io/', type = 'source') |
After the packages are successfully downloaded and installed.
Type this and quit
1 2 3 | IRkernel::installspec() quit() |
Start the notebook and check new -> R
You can install Jupyter on Raspberry Pi
1 2 3 4 5 | $ apt-get install python3-matplotlib $ apt-get install python3-scipy $ pip3 install --upgrade pip $ reboot $ sudo pip3 install jupyter |
To start
1 | $ jupyter-notebook |
Simple Python example:
1 2 3 4 5 6 | import pyspark sc = pyspark.SparkContext('local[*]') # do something to prove it works rdd = sc.parallelize(range(1000)) rdd.takeSample(False, 5) |
Simple R example:
1 2 3 4 5 6 7 | library(SparkR) as <- sparkR.session("local[*]") # do something to prove it works df <- as.DataFrame(iris) head(filter(df, df$Petal_Width > 0.2)) |
Simple Scala example:
Use the pre-configured SparkContext in variable sc.
Links
Apache Torre
https://github.com/asimjalis/apache-toree-quickstart
R on Jupyter
https://discuss.analyticsvidhya.com/t/how-to-run-r-on-jupyter-ipython-notebooks/5512/2