TAGS :Viewed: 2 - Published at: a few seconds ago

[ Docker Compose for Apache Spark. Develop clustered application locally… ]

Use Apache Spark to showcase building a Docker Compose stack In this article, I shall try to present a way to build a clustered application using Apache Spark. Motivation In a typical development setup of writing an Apache Spark application, one is generally limited into running a single node spark application during development from a local compute (like laptop). Also, docker is generally installed in most of the developer’s compute. The better case scenario is to have a Linux hosted Operating System (instead of Windows or Mac OS as the later tends to be a bit slower with docker). The need for multiple node instance of Apache Spark is when spark-submit based features are being built into a service under development. Setup Docker (most commonly installed as Docker CE) needs to be installed along with docker-compose. Once installed, make sure docker service is running. On a side note, if you would would prefer to clone the code, feel free to clone https://github.com/babloo80/docker-spark-cluster Base Image In order to get started, we’ll need to setup a base-image containing Apache Spark 2.4.4 (Please substitute it with the version that fits your requirements). $ cat spark-master/Dockerfile containing instructions to grab spark binary from apache with a container supporting Java 8. FROM openjdk:8u222-stretch ENV DAEMON_RUN=true ENV SPARK_VERSION=2.4.4 ENV HADOOP_VERSION=2.7 ENV SPARK_HOME=/spark RUN apt-get update && apt-get install -y curl vim wget software-properties-common ssh net-tools ca-certificates jq ca-certificates wget tar && mv spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION} spark \ && rm spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz RUN wget --no-verbose http://apache.mirror.iphh.net/spark/spark-${SPARK_VERSION}/spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz && tar -xvzf spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz \&& mv spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION} spark \&& rm spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz Using this as our base image, we can build the next. Master Image $ cat spark-master/Dockerfile containing a launcher script to start the master node. FROM spark-base:latest COPY start-master.sh / ENV SPARK_MASTER_PORT 7077 ENV SPARK_MASTER_WEBUI_PORT 8080 ENV SPARK_MASTER_LOG /spark/logs CMD ["/bin/bash", "/start-master.sh"] And start-master.sh export SPARK_MASTER_HOST=`hostname` . "/spark/sbin/spark-config.sh" . "/spark/bin/load-spark-env.sh" mkdir -p $SPARK_MASTER_LOG export SPARK_HOME=/spark ln -sf /dev/stdout $SPARK_MASTER_LOG/spark-master.out cd /spark/bin && /spark/sbin/../bin/spark-class org.apache.spark.deploy.master.Master --ip $SPARK_MASTER_HOST --port $SPARK_MASTER_PORT --webui-port $SPARK_MASTER_WEBUI_PORT >> $SPARK_MASTER_LOG/spark-master.out Worker Image $ cat spark-worker/Dockerfile containing launcher script for a worker (similar to master). Except that, we’ll have one or more worker to generalize with. FROM spark-base:latest COPY start-worker.sh / ENV SPARK_WORKER_WEBUI_PORT 8081 ENV SPARK_WORKER_LOG /spark/logs CMD ["/bin/bash", "/start-worker.sh"] Next step is to build the image to prepare for docker-compose.yml . In this step, we build all the Dockfile we just created. And execute this script to prepare for the next step. $cat build-images.sh #!/bin/bash set -e docker build -t spark-base:latest ./docker/base docker build -t spark-master:latest ./docker/spark-master docker build -t spark-worker:latest ./docker/spark-worker The last one is docker-compose.yml . Here, we create an easy to remember IP Address 10.5.0.2 for the master node so that one can hardcode the spark master as spark://10.5.0.2:7070 . We also have two instances of worker setup with 4 cores each and 2 GB each of memory. For more information of possible env vars for master / worker, please refer to http://spark.apache.org/docs/latest/spark-standalone.html#cluster-launch-scripts version: "3.7" networks: spark-network: ipam: config: - subnet: 10.5.0.0/16 services: spark-master: image: spark-master:latest ports: - "9090:8080" - "7077:7077" volumes: - ./apps:/opt/spark-apps - ./data:/opt/spark-data environment: - "SPARK_LOCAL_IP=spark-master" networks: spark-network: ipv4_address: 10.5.0.2 spark-worker-1: image: spark-worker:latest depends_on: - spark-master environment: - SPARK_MASTER=spark://spark-master:7077 - SPARK_WORKER_CORES=4 - SPARK_WORKER_MEMORY=2G volumes: - ./apps:/opt/spark-apps - ./data:/opt/spark-data networks: spark-network: ipv4_address: 10.5.0.3 spark-worker-2: image: spark-worker:latest depends_on: - spark-master environment: - SPARK_MASTER=spark://spark-master:7077 - SPARK_WORKER_CORES=4 - SPARK_WORKER_MEMORY=2G volumes: - ./apps:/opt/spark-apps - ./data:/opt/spark-data networks: spark-network: ipv4_address: 10.5.0.4