Version: 0.6.0


This document gives you a quick view on the basic usage of Submarine platform. You can finish each step of ML model lifecycle on the platform without messing up with the troublesome environment problems.


Prepare a Kubernetes cluster#

  1. Prerequisite
  1. Start minikube cluster
$ minikube start --vm-driver=docker --cpus 8 --memory 4096 --kubernetes-version v1.15.11

Launch submarine in the cluster#

  1. Clone the project
$ git clone
  1. Install the resources by helm chart
$ cd submarine
$ helm install submarine ./helm-charts/submarine

Ensure submarine is ready#

  1. Use kubectl to query the status of pods
$ kubectl get pods
  1. Make sure each pod is Running
notebook-controller-deployment-5d4f5f874c-vwds8 1/1 Running 0 3h33m
pytorch-operator-844c866d54-q5ztd 1/1 Running 0 3h33m
submarine-database-674987ff7d-r8zqs 1/1 Running 0 3h33m
submarine-minio-5fdd957785-xd987 1/1 Running 0 3h33m
submarine-mlflow-76bbf5c7b-g2ntd 1/1 Running 0 3h33m
submarine-server-66f7b8658b-sfmv8 1/1 Running 0 3h33m
submarine-tensorboard-6c44944dfb-tvbr9 1/1 Running 0 3h33m
submarine-traefik-7cbcfd4bd9-4bczn 1/1 Running 0 3h33m
tf-job-operator-6bb69fd44-mc8ww 1/1 Running 0 3h33m

Connect to workbench#

  1. Port-forwarding
# using port-forwarding
$ kubectl port-forward --address service/submarine-traefik 32080:80
  1. Open

Example: Submit a mnist distributed example#

We put the code of this example here. is our training script, and is the script to build a docker image.

1. Write a python script for distributed training#

Take a simple mnist tensorflow script as an example. We choose MultiWorkerMirroredStrategy as our distributed strategy.

import tensorflow_datasets as tfds
import tensorflow as tf
from tensorflow.keras import layers, models
from submarine import ModelsClient
def make_datasets_unbatched():
# Scaling MNIST data from (0, 255] to (0., 1.]
def scale(image, label):
image = tf.cast(image, tf.float32)
image /= 255
return image, label
datasets, _ = tfds.load(name='mnist', with_info=True, as_supervised=True)
return datasets['train'].map(scale).cache().shuffle(BUFFER_SIZE)
def build_and_compile_cnn_model():
model = models.Sequential()
layers.Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(10, activation='softmax'))
return model
def main():
strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy(
BATCH_SIZE = BATCH_SIZE_PER_REPLICA * strategy.num_replicas_in_sync
with strategy.scope():
ds_train = make_datasets_unbatched().batch(BATCH_SIZE).repeat()
options =
options.experimental_distribute.auto_shard_policy = \
ds_train = ds_train.with_options(options)
# Model building/compiling need to be within `strategy.scope()`.
multi_worker_model = build_and_compile_cnn_model()
class MyCallback(tf.keras.callbacks.Callback):
def on_epoch_end(self, epoch, logs=None):
# monitor the loss and accuracy
modelClient.log_metrics({"loss": logs["loss"], "accuracy": logs["accuracy"]}, epoch)
with modelClient.start() as run:, epochs=10, steps_per_epoch=70, callbacks=[MyCallback()])
if __name__ == '__main__':
modelClient = ModelsClient()

2. Prepare an environment compatible with the training#

Build a docker image equipped with the requirement of the environment.

$ ./dev-support/examples/quickstart/

3. Submit the experiment#

  1. Open submarine workbench and click + New Experiment

  2. Fill the form accordingly. Here we set 3 workers.

    1. Step 1
    2. Step 2
    3. Step 3
    4. The experiment is successfully submitted

4. Monitor the process (modelClient)#

  1. In our code, we use modelClient from submarine-sdk to record the metrics. To see the result, click MLflow UI in the workbench.

  2. To compare the metrics of each worker, you can select all workers and then click compare

5. Serve the model (In development)#