TAGS :Viewed: 2 -
Published at: a few seconds ago
[ Docker Compose for Apache Spark. Develop clustered application locally… ]
Use Apache Spark to showcase building a Docker Compose stack
In this article, I shall try to present a way to build a clustered application using Apache Spark.
Motivation
In a typical development setup of writing an Apache Spark application, one is generally limited into running a single node spark application during development from a local compute (like laptop).
Also, docker is generally installed in most of the developer’s compute. The better case scenario is to have a Linux hosted Operating System (instead of Windows or Mac OS as the later tends to be a bit slower with docker).
The need for multiple node instance of Apache Spark is when spark-submit based features are being built into a service under development.
Setup
Docker (most commonly installed as Docker CE) needs to be installed along with docker-compose.
Once installed, make sure docker service is running.
On a side note, if you would would prefer to clone the code, feel free to clone https://github.com/babloo80/docker-spark-cluster
Base Image
In order to get started, we’ll need to setup a base-image containing Apache Spark 2.4.4 (Please substitute it with the version that fits your requirements).
$ cat spark-master/Dockerfile
containing instructions to grab spark binary from apache with a container supporting Java 8.
FROM openjdk:8u222-stretch ENV DAEMON_RUN=true
ENV SPARK_VERSION=2.4.4
ENV HADOOP_VERSION=2.7
ENV SPARK_HOME=/spark RUN apt-get update && apt-get install -y curl vim wget software-properties-common ssh net-tools ca-certificates jq ca-certificates wget tar
&& mv spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION} spark \
&& rm spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz RUN wget --no-verbose http://apache.mirror.iphh.net/spark/spark-${SPARK_VERSION}/spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz && tar -xvzf spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz \&& mv spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION} spark \&& rm spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz
Using this as our base image, we can build the next.
Master Image
$ cat spark-master/Dockerfile
containing a launcher script to start the master node.
FROM spark-base:latest COPY start-master.sh / ENV SPARK_MASTER_PORT 7077
ENV SPARK_MASTER_WEBUI_PORT 8080
ENV SPARK_MASTER_LOG /spark/logs CMD ["/bin/bash", "/start-master.sh"]
And start-master.sh
export SPARK_MASTER_HOST=`hostname`
. "/spark/sbin/spark-config.sh"
. "/spark/bin/load-spark-env.sh"
mkdir -p $SPARK_MASTER_LOG
export SPARK_HOME=/spark
ln -sf /dev/stdout $SPARK_MASTER_LOG/spark-master.out
cd /spark/bin && /spark/sbin/../bin/spark-class org.apache.spark.deploy.master.Master --ip $SPARK_MASTER_HOST --port $SPARK_MASTER_PORT --webui-port $SPARK_MASTER_WEBUI_PORT >> $SPARK_MASTER_LOG/spark-master.out
Worker Image
$ cat spark-worker/Dockerfile
containing launcher script for a worker (similar to master). Except that, we’ll have one or more worker to generalize with.
FROM spark-base:latest COPY start-worker.sh / ENV SPARK_WORKER_WEBUI_PORT 8081
ENV SPARK_WORKER_LOG /spark/logs CMD ["/bin/bash", "/start-worker.sh"]
Next step is to build the image to prepare for docker-compose.yml . In this step, we build all the Dockfile we just created. And execute this script to prepare for the next step.
$cat build-images.sh
#!/bin/bash set -e docker build -t spark-base:latest ./docker/base
docker build -t spark-master:latest ./docker/spark-master
docker build -t spark-worker:latest ./docker/spark-worker
The last one is docker-compose.yml . Here, we create an easy to remember IP Address 10.5.0.2 for the master node so that one can hardcode the spark master as spark://10.5.0.2:7070 .
We also have two instances of worker setup with 4 cores each and 2 GB each of memory. For more information of possible env vars for master / worker, please refer to http://spark.apache.org/docs/latest/spark-standalone.html#cluster-launch-scripts
version: "3.7"
networks:
spark-network:
ipam:
config:
- subnet: 10.5.0.0/16 services:
spark-master:
image: spark-master:latest
ports:
- "9090:8080"
- "7077:7077"
volumes:
- ./apps:/opt/spark-apps
- ./data:/opt/spark-data
environment:
- "SPARK_LOCAL_IP=spark-master"
networks:
spark-network:
ipv4_address: 10.5.0.2 spark-worker-1:
image: spark-worker:latest
depends_on:
- spark-master
environment:
- SPARK_MASTER=spark://spark-master:7077
- SPARK_WORKER_CORES=4
- SPARK_WORKER_MEMORY=2G
volumes:
- ./apps:/opt/spark-apps
- ./data:/opt/spark-data
networks:
spark-network:
ipv4_address: 10.5.0.3 spark-worker-2:
image: spark-worker:latest
depends_on:
- spark-master
environment:
- SPARK_MASTER=spark://spark-master:7077
- SPARK_WORKER_CORES=4
- SPARK_WORKER_MEMORY=2G
volumes:
- ./apps:/opt/spark-apps
- ./data:/opt/spark-data
networks:
spark-network:
ipv4_address: 10.5.0.4