Apache Spark

Apache Spark on Docker

One of the hurdles to quick development in Apache Spark is having to set up a working cluster to test on. And even if you do have a working cluster, how to you test your code when you’re on a train with an intermittent internet connection.

Maybe installing it locally, as wll as Java, Python and all the other bits, hopefully with the right versions and they hopefully won’t clash with the versions you already have.  Hang on, this is getting messy.

If only there was a quick way to spin up a local cluster that didn’t interfere with the stuff already on your machine.  Let’s do this in Docker.

Docker allows us to run virtual machines on your desktop.  The machine “images” already have the software you need installed on them and all you need to do is tell docker what network ports to open and how to connect multiple images together.

The following docker-compose.yml file will setup all the networking and connect to your local drive, enabling access to any data and scripts, allowing you to edit them locally.

docker-compose.yml

spark-master:
    image: timvw74/spark
    command: bin/spark-class org.apache.spark.deploy.master.Master -h spark-master
    hostname: spark-master
    environment:
      MASTER: spark://spark-master:7077
      SPARK_CONF_DIR: /conf
      SPARK_PUBLIC_DNS: 127.0.0.1
    expose:
      - 7001
      - 7002
      - 7003
      - 7004
      - 7005
      - 7006
      - 7077
      - 6066
    ports:
      - 4040:4040
      - 6066:6066
      - 7077:7077
      - 8080:8080
    volumes:
      - ./conf/spark-master:/conf
      - ./data:/tmp/data  

spark-worker-1:
    image: timvw74/spark
    command: bin/spark-class org.apache.spark.deploy.worker.Worker spark://spark-master:7077
    hostname: spark-worker-1
    environment:
      SPARK_CONF_DIR: /conf
      SPARK_PUBLIC_DNS: 127.0.0.1
      SPARK_WORKER_CORES: 2
      SPARK_WORKER_MEMORY: 2g
      SPARK_WORKER_PORT: 8881
      SPARK_WORKER_WEBUI_PORT: 8081
    links:
      - spark-master
    expose:
      - 7012
      - 7013
      - 7014
      - 7015
      - 7016
      - 8881
    ports:
      - 8081:8081
    volumes:
      - ./conf/spark-worker-1:/conf
      - ./data:/tmp/data

 

Just copy the docker-compose.yml file into a folder and launch the containers by issuing ‘docker-compose up’   This will download the images if they haven’t yet been downloaded, setup the network and drive shares then start the instances.

Leave a Reply

Your email address will not be published. Required fields are marked *