Generic Experiment Spec
Motivationβ
As the machine learning platform, the submarine should support multiple machine learning frameworks, such as Tensorflow, Pytorch etc. But different framework has different distributed components for the training experiment. So that we designed a generic experiment spec to abstract the training experiment across different frameworks. In this way, the submarine-server can hide the complexity of underlying infrastructure differences and provide a cleaner interface to manager experiments
Proposalβ
Considering the Tensorflow and Pytorch framework, we propose one spec which consists of library spec, submitter spec and task specs etc. Such as:
name: "mnist"
librarySpec:
name: "TensorFlow"
version: "2.1.0"
image: "apache/submarine:tf-mnist-with-summaries-1.0"
cmd: "python /var/tf_mnist/mnist_with_summaries.py --log_dir=/train/log --learning_rate=0.01 --batch_size=150"
envVars:
ENV_1: "ENV1"
submitterSpec:
type: "k8s"
namespace: "submarine"
taskSpecs:
Ps:
name: tensorflow
replicas: 2
resources: "cpu=4,memory=2048M,nvidia.com/gpu=1"
Worker:
name: tensorflow
replicas: 2
resources: "cpu=4,memory=2048M,nvidia.com/gpu=1"
Library Specβ
The library spec describes the info about machine learning framework. All the fields as below:
field | type | optional | description |
---|---|---|---|
name | string | NO | Machine Learning Framework name. Only "tensorflow" and "pytorch" is supported. It doesn't matter if the value is uppercase or lowercase. |
version | string | NO | The version of ML framework. Such as: 2.1.0 |
image | string | NO | The public image used for each task if not specified. Such as: apache/submarine |
cmd | string | YES | The public entry cmd for the task if not specified. |
envVars | key/value | YES | The public env vars for the task if not specified. |
Submitter Specβ
It describes the info of submitter which the user specified, such as k8s. All the fields as below:
field | type | optional | description |
---|---|---|---|
type | string | NO | The submitter type, supports k8s now |
configPath | string | YES | The config path of the specified resource manager. You can set it in submarine-site.xml if run submarine-server locally |
namespace | string | NO | It's known as namespace in Kubernetes. |
kind | string | YES | It's used for k8s submitter, supports TFJob and PyTorchJob |
apiVersion | string | YES | It should pair with the kind, such as the TFJob's api version is kubeflow.org/v1 |
Task Specβ
It describes the task info, the tasks make up the experiment. So it must be specified when submit the experiment. All the tasks should putted into the key value collection. Such as:
taskSpecs:
Ps:
name: tensorflow
replicas: 2
resources: "cpu=4,memory=2048M,nvidia.com/gpu=1"
Worker:
name: tensorflow
replicas: 2
resources: "cpu=4,memory=2048M,nvidia.com/gpu=1"
All the fields as below:
field | type | optional | description |
---|---|---|---|
name | string | YES | The experiment name, if not specify using the library name |
image | string | YES | The experiment docker image |
cmd | string | YES | The entry command for running task |
envVars | key/value | YES | The environment variables for the task |
resources | string | NO | The limit resource for the task. Formatter: cpu=%s,memory=%s,nvidia.com/gpu=%s |
Implementsβ
For more info see SUBMARINE-321