Skip to main content
Version: 0.8.0

Generic Experiment Spec

Motivation​

As the machine learning platform, the submarine should support multiple machine learning frameworks, such as Tensorflow, Pytorch etc. But different framework has different distributed components for the training experiment. So that we designed a generic experiment spec to abstract the training experiment across different frameworks. In this way, the submarine-server can hide the complexity of underlying infrastructure differences and provide a cleaner interface to manager experiments

Proposal​

Considering the Tensorflow and Pytorch framework, we propose one spec which consists of library spec, submitter spec and task specs etc. Such as:

name: "mnist"
librarySpec:
name: "TensorFlow"
version: "2.1.0"
image: "apache/submarine:tf-mnist-with-summaries-1.0"
cmd: "python /var/tf_mnist/mnist_with_summaries.py --log_dir=/train/log --learning_rate=0.01 --batch_size=150"
envVars:
ENV_1: "ENV1"
submitterSpec:
type: "k8s"
namespace: "submarine"
taskSpecs:
Ps:
name: tensorflow
replicas: 2
resources: "cpu=4,memory=2048M,nvidia.com/gpu=1"
Worker:
name: tensorflow
replicas: 2
resources: "cpu=4,memory=2048M,nvidia.com/gpu=1"

Library Spec​

The library spec describes the info about machine learning framework. All the fields as below:

fieldtypeoptionaldescription
namestringNOMachine Learning Framework name. Only "tensorflow" and "pytorch" is supported. It doesn't matter if the value is uppercase or lowercase.
versionstringNOThe version of ML framework. Such as: 2.1.0
imagestringNOThe public image used for each task if not specified. Such as: apache/submarine
cmdstringYESThe public entry cmd for the task if not specified.
envVarskey/valueYESThe public env vars for the task if not specified.

Submitter Spec​

It describes the info of submitter which the user specified, such as k8s. All the fields as below:

fieldtypeoptionaldescription
typestringNOThe submitter type, supports k8s now
configPathstringYESThe config path of the specified resource manager. You can set it in submarine-site.xml if run submarine-server locally
namespacestringNOIt's known as namespace in Kubernetes.
kindstringYESIt's used for k8s submitter, supports TFJob and PyTorchJob
apiVersionstringYESIt should pair with the kind, such as the TFJob's api version is kubeflow.org/v1

Task Spec​

It describes the task info, the tasks make up the experiment. So it must be specified when submit the experiment. All the tasks should putted into the key value collection. Such as:

taskSpecs:
Ps:
name: tensorflow
replicas: 2
resources: "cpu=4,memory=2048M,nvidia.com/gpu=1"
Worker:
name: tensorflow
replicas: 2
resources: "cpu=4,memory=2048M,nvidia.com/gpu=1"

All the fields as below:

fieldtypeoptionaldescription
namestringYESThe experiment name, if not specify using the library name
imagestringYESThe experiment docker image
cmdstringYESThe entry command for running task
envVarskey/valueYESThe environment variables for the task
resourcesstringNOThe limit resource for the task. Formatter: cpu=%s,memory=%s,nvidia.com/gpu=%s

Implements​

For more info see SUBMARINE-321