This project has retired. For details please refer to its Attic page.
Quickstart | Apache Submarine
Skip to main content
Version: 0.7.0

Quickstart

This document gives you a quick view on the basic usage of Submarine platform. You can finish each step of ML model lifecycle on the platform without messing up with the troublesome environment problems.

Installation​

Prepare a Kubernetes cluster​

  1. Prerequisite
  1. Start minikube cluster
minikube start --vm-driver=docker --cpus 8 --memory 4096 --kubernetes-version v1.21.2

Launch submarine in the cluster​

  1. Clone the project
git clone https://github.com/apache/submarine.git
  1. Install the submarine operator and dependencies by helm chart
cd submarine
helm install submarine ./helm-charts/submarine
  1. Create a Submarine custom resource and the operator will create the submarine server, database, etc. for us.
kubectl apply -f submarine-cloud-v2/artifacts/examples/example-submarine.yaml

Ensure submarine is ready​

  1. Use kubectl to query the status of pods
kubectl get pods
  1. Make sure each pod is Running
NAME                                              READY   STATUS    RESTARTS   AGE
notebook-controller-deployment-5d4f5f874c-mnbc8 1/1 Running 0 61m
pytorch-operator-844c866d54-xm8nl 1/1 Running 2 61m
submarine-database-85bd68dbc5-qggtm 1/1 Running 0 11m
submarine-minio-76465444f6-hdgdp 1/1 Running 0 11m
submarine-mlflow-75f86d8f4d-rj2z7 1/1 Running 0 11m
submarine-operator-5dd79cdf86-gpm2p 1/1 Running 0 61m
submarine-server-68985b767-vjdvx 1/1 Running 0 11m
submarine-tensorboard-5df8499fd4-vnklf 1/1 Running 0 11m
submarine-traefik-7cbcfd4bd9-wbf8b 1/1 Running 0 61m
tf-job-operator-6bb69fd44-zmlmr 1/1 Running 1 61m

Connect to workbench​

  1. Exposing service

    # Method 1 -- use minikube ip
    minikube ip # you'll get the IP address of minikube, ex: 192.168.49.2

    # Method 2 -- use port-forwarding
    kubectl port-forward --address 0.0.0.0 service/submarine-traefik 32080:80
  2. View workbench If you use method 1, go to http://{minikube ip}:32080. For example, http://192.168.49.2:32080. If you use method 2, go to http://0.0.0.0:32080.

Example: Submit a mnist distributed example​

We put the code of this example here. train.py is our training script, and build.sh is the script to build a docker image.

1. Write a python script for distributed training​

Take a simple mnist tensorflow script as an example. We choose MultiWorkerMirroredStrategy as our distributed strategy.

"""
./dev-support/examples/quickstart/train.py
Reference: https://github.com/kubeflow/tf-operator/blob/master/examples/v1/distribution_strategy/keras-API/multi_worker_strategy-with-keras.py
"""

import tensorflow_datasets as tfds
import tensorflow as tf
from tensorflow.keras import layers, models
import submarine

def make_datasets_unbatched():
BUFFER_SIZE = 10000

# Scaling MNIST data from (0, 255] to (0., 1.]
def scale(image, label):
image = tf.cast(image, tf.float32)
image /= 255
return image, label

datasets, _ = tfds.load(name='mnist', with_info=True, as_supervised=True)

return datasets['train'].map(scale).cache().shuffle(BUFFER_SIZE)


def build_and_compile_cnn_model():
model = models.Sequential()
model.add(
layers.Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.Flatten())
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(10, activation='softmax'))

model.summary()

model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])

return model

def main():
strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy(
communication=tf.distribute.experimental.CollectiveCommunication.AUTO)

BATCH_SIZE_PER_REPLICA = 4
BATCH_SIZE = BATCH_SIZE_PER_REPLICA * strategy.num_replicas_in_sync

with strategy.scope():
ds_train = make_datasets_unbatched().batch(BATCH_SIZE).repeat()
options = tf.data.Options()
options.experimental_distribute.auto_shard_policy = \
tf.data.experimental.AutoShardPolicy.DATA
ds_train = ds_train.with_options(options)
# Model building/compiling need to be within `strategy.scope()`.
multi_worker_model = build_and_compile_cnn_model()

class MyCallback(tf.keras.callbacks.Callback):
def on_epoch_end(self, epoch, logs=None):
# monitor the loss and accuracy
print(logs)
submarine.log_metrics({"loss": logs["loss"], "accuracy": logs["accuracy"]}, epoch)

multi_worker_model.fit(ds_train, epochs=10, steps_per_epoch=70, callbacks=[MyCallback()])


if __name__ == '__main__':
main()

2. Prepare an environment compatible with the training​

Build a docker image equipped with the requirement of the environment.

eval $(minikube docker-env)
./dev-support/examples/quickstart/build.sh

3. Submit the experiment​

  1. Open submarine workbench and click + New Experiment

  2. Choose Define your experiment

  3. Fill the form accordingly. Here we set 3 workers.

    1. Step 1
    2. Step 2
    3. Step 3
    4. The experiment is successfully submitted

4. Monitor the process​

  1. In our code, we use submarine from submarine-sdk to record the metrics. To see the result, click corresponding experiment with name mnist-example in the workbench.
  2. To see the metrics of each worker, you can select a worker from the left top list.

5. Serve the model (In development)​