Running Spark Job On Minikube Kubernetes Cluster

Kubernetes is another industry buzz words these days and I am trying few different things with Kubernetes. On Feb 28th, 2018 Apache spark released v2.3.0, I am already working on Apache Spark and the new released has added a new Kubernetes scheduler backend that supports native submission of spark jobs to a cluster managed by kubernetes. The feature is still currently experimental but I wanted to try it out.

I am writing this post because I have to search a lot of things to get it working and there is no single place to get the exact information on how to run spark job on a local kubernetes cluster. I thought writing a post on my findings will enable others not to go through the same pain as I did.

In this post we will take a look on how to setup Kubernetes cluster using minikube on local machine and how to run apache spark job on top of it. We are not going to write any code for this, few examples are already available within Apache Spark distribution and we are going to use the same example jar and run SparkPi program from it onto our kubernetes cluster.

Prerequisites

Before we start please make sure you have the following pre-requisites installed on your local machine.

I am running the tools on Mac High Sierra.

Apache Spark (v2.3.0)
Kubernetes (v.1.9.3)
Docker
Minikube
VirtualBox

1. Start Minikube

We can start minikube locally by simply running below command.

	> minikube start
	Starting local Kubernetes v1.9.0 cluster...
	Starting VM...
	Getting VM IP address...
	Moving files into cluster...
	Setting up certs...
	Connecting to cluster...
	Setting up kubeconfig...
	Starting cluster components...
	Kubectl is now configured to use the cluster.
	Loading cached images from config file.

view raw start_minikube.sh hosted with ❤ by GitHub

Minikube is a tool that makes it easy to run Kubernetes locally. Minikube runs a single-node Kubernetes cluster inside a VM on a local machine. It enables us to try out Kubernetes or develop with it day-to-day.

When minikube starts it starts with a single node configuration and takes 1Gb of Memory and 2 cores of CPU by default, however, for running spark this requirement will not suffice and the jobs will fail (and they do as I have tried multiple times) so we have to increase both Memory and CPU cores for our minikube cluster.

There are three different ways to do this. However, for me not all worked and that’s how I came to know about the different ways to change the minikube configuration. You can use any one of the below ways to change the configuration.

Pass Configuration To Minikube Start Command

You can directly pass the memory and CPU options to the minikube start command like::

	> minikube start --memory 8192 --cpus 4
	Starting local Kubernetes v1.9.0 cluster...
	Starting VM...
	Getting VM IP address...
	Moving files into cluster...
	Setting up certs...
	Connecting to cluster...
	Setting up kubeconfig...
	Starting cluster components...
	Kubectl is now configured to use the cluster.
	Loading cached images from config file.

view raw start_minikube.sh hosted with ❤ by GitHub

This will start a minikube cluster with 8Gb of memory and 4 cores of CPU. (somehow for me this didn't work)

Modify minikube config

We can change the minikube config with config command in minikube cli. The config command allows setting different config options for minikube like memory, CPUs, vm-driver, disk-size etc. To see all the available options use > minikube config command it will list all the available options that can be modified.

	> minikube config set memory 8192
	These changes will take effect upon a minikube delete and then a minikube start

view raw modify_minikube_config.sh hosted with ❤ by GitHub

	> minikube config set cpus 4
	These changes will take effect upon a minikube delete and then a minikube start

view raw modify_minikube_config.sh hosted with ❤ by GitHub

After setting configuration using config command we need to delete our previous running cluster and start a new one. To delete minikube cluster run > minikube delete and rerun the start minikube command.

Change Configuration Using VirtualBox

For me, the above two options didn't work and if you are like me you can use this last option which worked for me and hope it works for you. Open VirtualBox app on your machine and select your VM like the one shown in below image. RightClick -> Settings on your VM, this will open the configuration page for the Minikube VM.

note:: To change configuration using VirtualBox first you need to shutdown the VM if it is already running.

Inside option System -> Motherboard, you can change the memory of the VM using the slider, in my case I have given it 8192MB of memory.

To change the CPU config, go to Processor tab and change it to 4. You can try other options also for me 4 works just fine. Don’t change it to less than 2 otherwise things won’t work. ::p

After the configuration changes are done, start the VM using VirtualBox app or using the above given minikube start command.

Your cluster is running and now you need to have a docker image for spark. Let's see how to build the image next.

Creating Docker Image For Spark

Make sure you have Docker installed on your machine and the spark distribution is extracted.

Go inside your extracted spark folder and run the below command to create a spark docker image

some of the output logs are excluded.

	> ./bin/docker-image-tool.sh -m -t spark-docker build
	./bin/docker-image-tool.sh: line 125: ================================================================================: command not found
	Sending build context to Docker daemon 256.5MB
	Step 1/16 : FROM openjdk:8-alpine
	---> 224765a6bdbe
	Step 2/16 : ARG spark_jars=jars
	---> Using cache
	---> dd42fdc28d7a
	Step 3/16 : ARG img_path=kubernetes/dockerfiles
	---> Using cache
	---> 570fc343f883
	Step 4/16 : RUN set -ex && apk upgrade --no-cache && apk add --no-cache bash tini libc6-compat && mkdir -p /opt/spark && mkdir -p /opt/spark/work-dir touch /opt/spark/RELEASE && rm /bin/sh && ln -sv /bin/bash /bin/sh && chgrp root /etc/passwd && chmod ug+rw /etc/passwd
	---> Using cache
	---> 705ae89cb075
	Step 5/16 : COPY ${spark_jars} /opt/spark/jars
	---> Using cache
	---> 506ab8c02abf
	Step 9/16 : COPY ${img_path}/spark/entrypoint.sh /opt/
	---> Using cache
	---> 03959bfa5250
	Step 10/16 : COPY examples /opt/spark/examples
	---> Using cache
	---> 5a2f91a7ce3e
	Step 11/16 : COPY data /opt/spark/data
	---> Using cache
	---> 58090cef2be4
	Step 13/16 : RUN echo $(ls /opt/spark/examples/jars/)
	---> Using cache
	---> 0a6c27628ac7
	Step 14/16 : ENV SPARK_HOME /opt/spark
	---> Using cache
	---> 7f32a22e0196
	Step 16/16 : ENTRYPOINT [ "/opt/entrypoint.sh" ]
	---> Using cache
	---> 01daf3302719
	Successfully built 01daf3302719
	Successfully tagged spark:spark-docker

	> docker image ls
	REPOSITORY TAG IMAGE ID CREATED SIZE
	spark-docker v0.1 01daf3302719 2 days ago 347MB

view raw build_spark_d_image.sh hosted with ❤ by GitHub

Now if you run > docker image ls you will see the docker build available in your local machine. Make a note of this image name we need to provide the image name to spark-submit command.

There is a push option available to the above command which enables you to push the docker image to your own repository this, in turn, will enable your production kubernetes to pull the docker image from the configured Docker repository. Run the same command without any options to see its usage.

It might happen that the command will not work and will give error like::

	./bin/docker-image-tool.sh: line 125: ================================================================================: command not found
	“docker build” requires exactly 1 argument.
	See ‘docker build — help’.
	Usage: docker build [OPTIONS] PATH \| URL \| — [flags]
	Build an image from a Dockerfile

view raw error.sh hosted with ❤ by GitHub

This is because of an issue with the docker-image-tool.sh file. I have raised a bug for this in Apache Spark JIRA you can see it here.

The issue is under fix but for you to continue with this post what you can do is open the docker-image-tool.sh file present inside the bin folder and after line no 59 add BUILD_ARGS=(), save the file and run the command once again and it will work.

For those who the above command worked without the workaround the issue might have been fixed at the time you are reading this post and you don't have to do anything hurrah!!!

Submit Spark Job

Now lets submit our SparkPi job to the cluster. Our cluster is ready and we have the docker image. Run the below command to submit the spark job on a kubernetes cluster. The spark-submit script takes care of setting up the classpath with Spark and its dependencies, and can support different cluster managers and deploy modes that Spark supports

	> spark-submit \
	--master k8s://https://192.168.99.100:8443 \
	--deploy-mode cluster \
	--name spark-pi \
	--class org.apache.spark.examples.SparkPi \
	--conf spark.executor.instances=3 \
	--conf spark.kubernetes.container.image=spark-docker \
	local:///opt/spark/examples/jars/spark-examples_2.11-2.3.0.jar

view raw submit_job.sh hosted with ❤ by GitHub

options used to run on kubernetes are::

--class:: The entry point for your application (e.g. org.apache.spark.examples.SparkPi)
--master:: The master URL for the Kubernetes cluster (e.g. k8s:😕/https://192.168.99.100:8443)
--deploy-mode:: Whether to deploy your driver on the worker nodes (cluster) or locally as an external client (client) (default:: client)
--conf spark.executor.instances=3:: configuration property to specify how many executor instances to use while running the spark job.
--conf spark.kubernetes.container.image=spark-docker:: Configuration property to specify which docker image to use, here provide the same docker name from docker image ls command.
local:///opt/spark/examples/jars/spark-examples_2.11-2.3.0.jar:: Path to a bundled jar including your application and all dependencies. The URL must be globally visible inside of your cluster, for instance, an hdfs:😕/ path or a local:😕/ path that is present on all nodes

The package jar should be available cluster-wide either through HDFS, HTTP or should be available within every packaged docker image so that it will be available to all the executor nodes in our spark program. The local:// address means that the jar is available as a local file on every initialized pod by the spark driver and no IO has to be made to pull the jar from anywhere, this works well when application jar is large which is pushed to every worker node or shared using any shared filesystem.

Luckily, our spark docker image file packages the example jar in the docker container so we can use it. However, how to package our own application code and push it either on HDFS or as a separate docker image I will write it as a separate post.

To check if the pods are started and the spark job is running, open kubernetes dashboard available within minikube.

View Minikube Dashboard & Kubernetes Logs

To check the status of our submitted job we can use either kubernetes dashboard or view kubernetes logs. Minikube comes with the dashboard available as an addon and can be started using the following command.

	> minikube dashboard
	Opening kubernetes dashboard in default browser...

	OR

	> minikube service list
	\|-------------\|------------------------------------------------------\|--------------------------------\|
	\| NAMESPACE \| NAME \| URL \|
	\|-------------\|------------------------------------------------------\|--------------------------------\|
	\| default \| kubernetes \| No node port \|
	\| default \| neo4j \| No node port \|
	\| default \| neo4j-public \| http://192.168.99.100:30388 \|
	\| \| \| http://192.168.99.100:30598 \|
	\| default \| spark-pi-709e1c1b19813e7cbc1aeff45200c64e-driver-svc \| No node port \|
	\| kube-system \| kube-dns \| No node port \|
	\| kube-system \| kubernetes-dashboard \| http://192.168.99.100:30000 \|
	\|-------------\|------------------------------------------------------\|--------------------------------\|

view raw minikube_service_ls.sh hosted with ❤ by GitHub

Navigate to the URL given by above command to view the dashboard. The dashboard provides lots of information about cluster memory usage, CPU usage, pods, services, replica set etc. We can also view service logs directly through the dashboard. However, if you don't want to go to the dashboard you can view the Spark Driver log using > kubectl logs <pod name> command::

	> kubectl logs spark-pi-709e1c1b19813e7cbc1aeff45200c64e-driver
	.....
	2018-03-07 13:10:35 INFO TaskSchedulerImpl:54 - Removed TaskSet 0.0, whose tasks have all completed, from pool
	2018-03-07 13:10:35 INFO DAGScheduler:54 - Job 0 finished: reduce at SparkPi.scala:38, took 1.091617 s
	Pi is roughly 3.1389956949784747
	2018-03-07 13:10:35 INFO AbstractConnector:318 - Stopped Spark@53e211ee{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
	2018-03-07 13:10:35 INFO SparkUI:54 - Stopped Spark web UI at http://spark-pi-709e1c1b19813e7cbc1aeff45200c64e-driver-svc.default.svc:4040
	2018-03-07 13:10:35 INFO KubernetesClusterSchedulerBackend:54 - Shutting down all executors
	2018-03-07 13:10:35 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint:54 - Asking each executor to shut down
	2018-03-07 13:10:35 INFO KubernetesClusterSchedulerBackend:54 - Closing kubernetes client
	2018-03-07 13:10:35 INFO MapOutputTrackerMasterEndpoint:54 - MapOutputTrackerMasterEndpoint stopped!
	2018-03-07 13:10:35 INFO MemoryStore:54 - MemoryStore cleared
	2018-03-07 13:10:35 INFO BlockManager:54 - BlockManager stopped
	2018-03-07 13:10:35 INFO BlockManagerMaster:54 - BlockManagerMaster stopped
	2018-03-07 13:10:35 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint:54 - OutputCommitCoordinator stopped!
	2018-03-07 13:10:35 INFO SparkContext:54 - Successfully stopped SparkContext
	2018-03-07 13:10:35 INFO ShutdownHookManager:54 - Shutdown hook called
	......

view raw spark_logs.sh hosted with ❤ by GitHub

If you see in the logs the program has calculated the Pi value and the container are stopped in the end.

Shutdown Cluster

Shutting down cluster is very easy, use > minikube stop and it will stop the cluster.

Hope this helps you try running apache spark on local kubernetes cluster. Please do comment if you didn’t understand any step or getting any errors while following the steps and I will try to add more details to the post. Also, please do let me know if you liked it and help others by sharing.

Happy Coding...

	> minikube dashboard
	Opening kubernetes dashboard in default browser...

	OR

	> minikube service list
	\|-------------\|------------------------------------------------------\|--------------------------------\|
	\| NAMESPACE \| NAME \| URL \|
	\|-------------\|------------------------------------------------------\|--------------------------------\|
	\| default \| kubernetes \| No node port \|
	\| default \| neo4j \| No node port \|
	\| default \| neo4j-public \| http://192.168.99.100:30388 \|
	\| \| \| http://192.168.99.100:30598 \|
	\| default \| spark-pi-709e1c1b19813e7cbc1aeff45200c64e-driver-svc \| No node port \|
	\| kube-system \| kube-dns \| No node port \|
	\| kube-system \| kubernetes-dashboard \| http://192.168.99.100:30000 \|
	\|-------------\|------------------------------------------------------\|--------------------------------\|