This project has retired. For details please refer to its Attic page.
YARN Runtime Quick Start Guide | Apache Submarine
Skip to main content
Version: 0.6.0

YARN Runtime Quick Start Guide


Check out the Running Submarine on YARN

Build your own Docker image

When you follow the documents below, and want to build your own Docker image for Tensorflow/PyTorch/MXNet? Please check out Build your Docker image for more details.

Launch TensorFlow Application:

Without Docker

You need:

  • Build a Python virtual environment with TensorFlow 1.13.1 installed
  • A cluster with Hadoop 2.9 or above.

Building a Python virtual environment with TensorFlow

TonY requires a Python virtual environment zip with TensorFlow and any needed Python libraries already installed.

tar xf virtualenv-16.0.0.tar.gz

# Make sure to install using Python 3, as TensorFlow only provides Python 3 artifacts
python virtualenv-16.0.0/ venv
. venv/bin/activate
pip install tensorflow==1.13.1
zip -r venv

The above commands will produced a and it will be used in below example. There's no need to copy it to other nodes. And it is not needed when using Docker to run the job.

Note: If you require a version of TensorFlow and TensorBoard prior to 1.13.1, take a look at this issue.

Get the training examples

Get from

CLASSPATH=$(hadoop classpath --glob):path-to/submarine-all-${SUBMARINE_VERSION}-hadoop-${SUBMARINE_HADOOP_VERSION}.jar \
java org.apache.submarine.client.cli.Cli job run --name tf-job-001 \
--framework tensorflow \
--verbose \
--input_path "" \
--num_workers 2 \
--worker_resources memory=1G,vcores=1 \
--num_ps 1 \
--ps_resources memory=1G,vcores=1 \
--worker_launch_cmd " --steps 2 --data_dir /tmp/data --working_dir /tmp/mode" \
--ps_launch_cmd " --steps 2 --data_dir /tmp/data --working_dir /tmp/mode" \
--insecure \
--conf tony.containers.resources=path-to/,path-to/,path-to/submarine-all-${SUBMARINE_VERSION}-hadoop-${SUBMARINE_HADOOP_VERSION}.jar

You should then be able to see links and status of the jobs from command line:

2019-04-22 20:30:42,611 INFO tony.TonyClient: Tasks Status Updated: [TaskInfo] name: worker index: 0 url: http://pi-aw:8042/node/containerlogs/container_1555916523933_0030_01_000003/pi status: RUNNING
2019-04-22 20:30:42,612 INFO tony.TonyClient: Tasks Status Updated: [TaskInfo] name: worker index: 1 url: http://pi-aw:8042/node/containerlogs/container_1555916523933_0030_01_000004/pi status: RUNNING
2019-04-22 20:30:42,612 INFO tony.TonyClient: Tasks Status Updated: [TaskInfo] name: ps index: 0 url: http://pi-aw:8042/node/containerlogs/container_1555916523933_0030_01_000002/pi status: RUNNING
2019-04-22 20:30:42,612 INFO tony.TonyClient: Logs for ps 0 at: http://pi-aw:8042/node/containerlogs/container_1555916523933_0030_01_000002/pi
2019-04-22 20:30:42,612 INFO tony.TonyClient: Logs for worker 0 at: http://pi-aw:8042/node/containerlogs/container_1555916523933_0030_01_000003/pi
2019-04-22 20:30:42,612 INFO tony.TonyClient: Logs for worker 1 at: http://pi-aw:8042/node/containerlogs/container_1555916523933_0030_01_000004/pi
2019-04-22 20:30:44,625 INFO tony.TonyClient: Tasks Status Updated: [TaskInfo] name: ps index: 0 url: http://pi-aw:8042/node/containerlogs/container_1555916523933_0030_01_000002/pi status: FINISHED
2019-04-22 20:30:44,625 INFO tony.TonyClient: Tasks Status Updated: [TaskInfo] name: worker index: 0 url: http://pi-aw:8042/node/containerlogs/container_1555916523933_0030_01_000003/pi status: FINISHED
2019-04-22 20:30:44,626 INFO tony.TonyClient: Tasks Status Updated: [TaskInfo] name: worker index: 1 url: http://pi-aw:8042/node/containerlogs/container_1555916523933_0030_01_000004/pi status: FINISHED

With Docker

CLASSPATH=$(hadoop classpath --glob):path-to/submarine-all-${SUBMARINE_VERSION}-hadoop-${SUBMARINE_HADOOP_VERSION}.jar \
java org.apache.submarine.client.cli.Cli job run --name tf-job-001 \
--framework tensorflow \
--docker_image hadoopsubmarine/tf-1.8.0-cpu:0.0.1 \
--input_path hdfs://pi-aw:9000/dataset/cifar-10-data \
--worker_resources memory=3G,vcores=2 \
--worker_launch_cmd "export CLASSPATH=\$(/hadoop-3.1.0/bin/hadoop classpath --glob) && cd /test/models/tutorials/image/cifar10_estimator && python --data-dir=%input_path% --job-dir=%checkpoint_path% --train-steps=10000 --eval-batch-size=16 --train-batch-size=16 --variable-strategy=CPU --num-gpus=0 --sync" \
--env JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 \
--env DOCKER_HADOOP_HDFS_HOME=/hadoop-3.1.0 \
--env DOCKER_JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 \
--env HADOOP_HOME=/hadoop-3.1.0 \
--env HADOOP_YARN_HOME=/hadoop-3.1.0 \
--env HADOOP_COMMON_HOME=/hadoop-3.1.0 \
--env HADOOP_HDFS_HOME=/hadoop-3.1.0 \
--env HADOOP_CONF_DIR=/hadoop-3.1.0/etc/hadoop \
--conf tony.containers.resources=path-to/submarine-all-${SUBMARINE_VERSION}-hadoop-${SUBMARINE_HADOOP_VERSION}.jar


1) DOCKER_JAVA_HOME points to JAVA_HOME inside Docker image.

2) DOCKER_HADOOP_HDFS_HOME points to HADOOP_HDFS_HOME inside Docker image.

We removed TonY submodule after applying SUBMARINE-371 and changed to use TonY dependency directly.

After Submarine v0.2.0, there is a uber jar submarine-all-${SUBMARINE_VERSION}-hadoop-${HADOOP_VERSION}.jar released together with the submarine-core-${SUBMARINE_VERSION}.jar, submarine-yarnservice-runtime-${SUBMARINE_VERSION}.jar and submarine-tony-runtime-${SUBMARINE_VERSION}.jar.

Launch PyTorch Application:

Without Docker

You need:

  • Build a Python virtual environment with PyTorch 0.4.0+ installed
  • A cluster with Hadoop 2.9 or above.

Building a Python virtual environment with PyTorch

TonY requires a Python virtual environment zip with PyTorch and any needed Python libraries already installed.

tar xf virtualenv-16.0.0.tar.gz

python virtualenv-16.0.0/ venv
. venv/bin/activate
pip install pytorch==0.4.0
zip -r venv

Get the training examples

Get from

CLASSPATH=$(hadoop classpath --glob):path-to/submarine-all-${SUBMARINE_VERSION}-hadoop-${SUBMARINE_HADOOP_VERSION}.jar \
java org.apache.submarine.client.cli.Cli job run --name PyTorch-job-001 \
--framework pytorch
--num_workers 2 \
--worker_resources memory=3G,vcores=2 \
--num_ps 2 \
--ps_resources memory=3G,vcores=2 \
--worker_launch_cmd "" \
--ps_launch_cmd "" \
--insecure \
--conf tony.containers.resources=path-to/,path-to/, \

You should then be able to see links and status of the jobs from command line:

2019-04-22 20:30:42,611 INFO tony.TonyClient: Tasks Status Updated: [TaskInfo] name: worker index: 0 url: http://pi-aw:8042/node/containerlogs/container_1555916523933_0030_01_000003/pi status: RUNNING
2019-04-22 20:30:42,612 INFO tony.TonyClient: Tasks Status Updated: [TaskInfo] name: worker index: 1 url: http://pi-aw:8042/node/containerlogs/container_1555916523933_0030_01_000004/pi status: RUNNING
2019-04-22 20:30:42,612 INFO tony.TonyClient: Tasks Status Updated: [TaskInfo] name: ps index: 0 url: http://pi-aw:8042/node/containerlogs/container_1555916523933_0030_01_000002/pi status: RUNNING
2019-04-22 20:30:42,612 INFO tony.TonyClient: Logs for ps 0 at: http://pi-aw:8042/node/containerlogs/container_1555916523933_0030_01_000002/pi
2019-04-22 20:30:42,612 INFO tony.TonyClient: Logs for worker 0 at: http://pi-aw:8042/node/containerlogs/container_1555916523933_0030_01_000003/pi
2019-04-22 20:30:42,612 INFO tony.TonyClient: Logs for worker 1 at: http://pi-aw:8042/node/containerlogs/container_1555916523933_0030_01_000004/pi
2019-04-22 20:30:44,625 INFO tony.TonyClient: Tasks Status Updated: [TaskInfo] name: ps index: 0 url: http://pi-aw:8042/node/containerlogs/container_1555916523933_0030_01_000002/pi status: FINISHED
2019-04-22 20:30:44,625 INFO tony.TonyClient: Tasks Status Updated: [TaskInfo] name: worker index: 0 url: http://pi-aw:8042/node/containerlogs/container_1555916523933_0030_01_000003/pi status: FINISHED
2019-04-22 20:30:44,626 INFO tony.TonyClient: Tasks Status Updated: [TaskInfo] name: worker index: 1 url: http://pi-aw:8042/node/containerlogs/container_1555916523933_0030_01_000004/pi status: FINISHED

With Docker

CLASSPATH=$(hadoop classpath --glob):path-to/submarine-all-${SUBMARINE_VERSION}-hadoop-${SUBMARINE_HADOOP_VERSION}.jar \
java org.apache.submarine.client.cli.Cli job run --name PyTorch-job-001 \
--framework pytorch
--docker_image pytorch-latest-gpu:0.0.1 \
--input_path "" \
--num_workers 1 \
--worker_resources memory=3G,vcores=2 \
--worker_launch_cmd "cd /test/ && python" \
--env JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 \
--env DOCKER_HADOOP_HDFS_HOME=/hadoop-3.1.2 \
--env DOCKER_JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 \
--env HADOOP_HOME=/hadoop-3.1.2 \
--env HADOOP_YARN_HOME=/hadoop-3.1.2 \
--env HADOOP_COMMON_HOME=/hadoop-3.1.2 \
--env HADOOP_HDFS_HOME=/hadoop-3.1.2 \
--env HADOOP_CONF_DIR=/hadoop-3.1.2/etc/hadoop \
--conf tony.containers.resources=path-to/submarine-all-${SUBMARINE_VERSION}-hadoop-${SUBMARINE_HADOOP_VERSION}.jar

Launch MXNet Application:

Without Docker

You need:

  • Build a Python virtual environment with MXNet installed
  • A cluster with Hadoop 2.9 or above.

Building a Python virtual environment with MXNet

TonY requires a Python virtual environment zip with MXNet and any needed Python libraries already installed.

tar xf virtualenv-16.0.0.tar.gz

python virtualenv-16.0.0/ venv
. venv/bin/activate
pip install mxnet==1.5.1
zip -r venv

Get the training examples

Get from this link

CLASSPATH=$(hadoop classpath --glob):path-to/submarine-all-${SUBMARINE_VERSION}-hadoop-${SUBMARINE_HADOOP_VERSION}.jar \
java org.apache.submarine.client.cli.Cli job run --name MXNet-job-001 \
--framework mxnet
--input_path "" \
--num_workers 2 \
--worker_resources memory=3G,vcores=2 \
--worker_launch_cmd " --dataset cifar10 --model vgg11 --epochs 1 --kvstore dist_sync" \
--num_ps 2 \
--ps_resources memory=3G,vcores=2 \
--ps_launch_cmd " --dataset cifar10 --model vgg11 --epochs 1 --kvstore dist_sync" \
--num_schedulers=1 \
--scheduler_resources memory=1G,vcores=1 \
--scheduler_launch_cmd=" --dataset cifar10 --model vgg11 --epochs 1 --kvstore dist_sync" \
--insecure \
--conf tony.containers.resources=path-to/,path-to/, \

You should then be able to see links and status of the jobs from command line:

2020-04-16 20:23:43,834 INFO tony.TonyClient: Task status updated: [TaskInfo] name: server, index: 1, url: http://pi-aw:8042/node/containerlogs/container_1587037749540_0005_01_000004/pi status: RUNNING
2020-04-16 20:23:43,834 INFO tony.TonyClient: Task status updated: [TaskInfo] name: server, index: 0, url: http://pi-aw:8042/node/containerlogs/container_1587037749540_0005_01_000003/pi status: RUNNING
2020-04-16 20:23:43,834 INFO tony.TonyClient: Task status updated: [TaskInfo] name: worker, index: 1, url: http://pi-aw:8042/node/containerlogs/container_1587037749540_0005_01_000006/pi status: RUNNING
2020-04-16 20:23:43,834 INFO tony.TonyClient: Task status updated: [TaskInfo] name: worker, index: 0, url: http://pi-aw:8042/node/containerlogs/container_1587037749540_0005_01_000005/pi status: RUNNING
2020-04-16 20:23:43,834 INFO tony.TonyClient: Task status updated: [TaskInfo] name: scheduler, index: 0, url: http://pi-aw:8042/node/containerlogs/container_1587037749540_0005_01_000002/pi status: RUNNING
2020-04-16 20:23:43,839 INFO tony.TonyClient: Logs for scheduler 0 at: http://pi-aw:8042/node/containerlogs/container_1587037749540_0005_01_000002/pi
2020-04-16 20:23:43,839 INFO tony.TonyClient: Logs for server 0 at: http://pi-aw:8042/node/containerlogs/container_1587037749540_0005_01_000003/pi
2020-04-16 20:23:43,840 INFO tony.TonyClient: Logs for server 1 at: http://pi-aw:8042/node/containerlogs/container_1587037749540_0005_01_000004/pi
2020-04-16 20:23:43,840 INFO tony.TonyClient: Logs for worker 0 at: http://pi-aw:8042/node/containerlogs/container_1587037749540_0005_01_000005/pi
2020-04-16 20:23:43,840 INFO tony.TonyClient: Logs for worker 1 at: http://pi-aw:8042/node/containerlogs/container_1587037749540_0005_01_000006/pi
2020-04-16 21:02:09,723 INFO tony.TonyClient: Task status updated: [TaskInfo] name: scheduler, index: 0, url: http://pi-aw:8042/node/containerlogs/container_1587037749540_0005_01_000002/pi status: SUCCEEDED
2020-04-16 21:02:09,736 INFO tony.TonyClient: Task status updated: [TaskInfo] name: worker, index: 0, url: http://pi-aw:8042/node/containerlogs/container_1587037749540_0005_01_000005/pi status: SUCCEEDED
2020-04-16 21:02:09,737 INFO tony.TonyClient: Task status updated: [TaskInfo] name: server, index: 1, url: http://pi-aw:8042/node/containerlogs/container_1587037749540_0005_01_000004/pi status: SUCCEEDED
2020-04-16 21:02:09,737 INFO tony.TonyClient: Task status updated: [TaskInfo] name: worker, index: 1, url: http://pi-aw:8042/node/containerlogs/container_1587037749540_0005_01_000006/pi status: SUCCEEDED
2020-04-16 21:02:09,737 INFO tony.TonyClient: Task status updated: [TaskInfo] name: server, index: 0, url: http://pi-aw:8042/node/containerlogs/container_1587037749540_0005_01_000003/pi status: SUCCEEDED

With Docker

You could refer to this sample Dockerfile for building your own Docker image.

CLASSPATH=$(hadoop classpath --glob):path-to/submarine-all-${SUBMARINE_VERSION}-hadoop-${SUBMARINE_HADOOP_VERSION}.jar \
java org.apache.submarine.client.cli.Cli job run --name MXNet-job-001 \
--framework mxnet
--docker_image <your_docker_image> \
--input_path "" \
--num_schedulers 1 \
--scheduler_resources memory=1G,vcores=1 \
--scheduler_launch_cmd "/usr/bin/python --dataset cifar10 --model vgg11 --epochs 1 --kvstore dist_sync" \
--num_workers 2 \
--worker_resources memory=2G,vcores=1 \
--worker_launch_cmd "/usr/bin/python --dataset cifar10 --model vgg11 --epochs 1 --kvstore dist_sync" \
--num_ps 2 \
--ps_resources memory=2G,vcores=1 \
--ps_launch_cmd "/usr/bin/python --dataset cifar10 --model vgg11 --epochs 1 --kvstore dist_sync" \
--verbose \
--insecure \
--conf tony.containers.resources=path-to/,path-to/submarine-all-${SUBMARINE_VERSION}-hadoop-${SUBMARINE_HADOOP_VERSION}.jar

Use YARN Service to run Submarine: Deprecated

Historically, Submarine supports to use YARN Service to submit deep learning jobs. Now we stop supporting it because YARN service is not actively developed by community, and extra dependencies such as RegistryDNS/ATS-v2 causes lots of issues for setup.

As of now, you can still use YARN service to run Submarine, but code will be removed in the future release. We will only support use TonY when use Submarine on YARN.