Experiment Implementation
Overviewβ
This document talks about implementation of experiment, flows and design considerations.
Experiment consists of following components, also interact with other Submarine or 3rd-party components, showing below:
              +---------------------------------------+
 +----------+ |      Experiment Tasks                 |
 |Run       | |                                       |
 |Configs   | | +----------------------------------+  |
 +----------+ | |   Experiment Runnable Code       |  | +-----------------+
 +----------+ | |                                  |  | |Output Artifacts |
 |Input Data| | |     (Like train-job.py)          |  | |(Models, etc.)   |
 |          | | +----------------------------------+  | +-----------------+
 |          | | +----------------------------------+  |
 +----------+ | |   Experiment Deps (Like Python)  |  | +-------------+
              | +----------------------------------+  | |Logs/Metrics |
              | +----------------------------------+  | |             |
              | |  OS, Base Libaries (Like CUDA)   |  | +-------------+
              | +----------------------------------+  |
              +---------------------------------------+
                                 ^
                                 | (Launch Task with resources)
                                 +
                 +---------------------------------+
                 |Resource Manager (K8s/Cloud)|
                 +---------------------------------+
As showing in the above diagram, Submarine experiment consists of the following items:
- On the left side, there're input data and run configs.
- In the middle box, they're experiment tasks, it could be multiple tasks when we run distributed training, pipeline, etc.- There're main runnable code, such as train.pyfor the training main entry point.
- The two boxes below: experiment dependencies and OS/Base libraries we called Submarine Environment ProfileorEnvironmentfor short. Which defined what is the basic libraries to run the main experiment code.
- Experiment tasks are launched by Resource Manager, such as K8s/Cloud or just launched locally. There're resources constraints for each experiment tasks. (e.g. how much memory, cores, GPU, disk etc. can be used by tasks).
 
- There're main runnable code, such as 
- On the right side, they're artifacts generated by experiments:- Output artifacts: Which are main output of the experiment, it could be model(s), or output data when we do batch prediction.
- Logs/Metrics for further troubleshooting or understanding of experiment's quality.
 
For the rest of the design doc, we will talk about how we handle environment, code, and manage output/logs, etc.
API of Experimentβ
This is not a full definition of experiment, for more details, please reference to experiment API.
Here's just an example of experiment object which help developer to understand what included in an experiment.
experiment:
       name: "abc",
       type: "script",
       environment: "team-default-ml-env"
       code:
           sync_mode: s3
           url: "s3://bucket/training-job.tar.gz"
       parameter: > python training.py --iteration 10
                    --input=s3://bucket/input output=s3://bucket/output
       resource_constraint:
           res="mem=20gb, vcore=3, gpu=2"
       timeout: "30 mins"
This defined a "script" experiment, which has a name "abc", the name can be used to track the experiment. There's environment "team-default-ml-env" defined to make sure dependencies of the job can be downloaded properly before executing the job.
code defined where the experiment code will be downloaded, we will support a couple of sync_mode like s3 (or abfs/hdfs), git, etc.
Different types of experiments will have different specs, for example distributed Tensorflow spec may look like:
experiment:
       name: "abc-distributed-tf",
       type: "distributed-tf",
       ps:
            environment: "team-default-ml-cpu"
            resource_constraint:
                 res="mem=20gb, vcore=3, gpu=0"
       worker:
            environment: "team-default-ml-gpu"
            resource_constraint:
                 res="mem=20gb, vcore=3, gpu=2"
       code:
           sync_mode: git
           url: "https://foo.com/training-job.git"
       parameter: > python /code/training-job/training.py --iteration 10
                    --input=s3://bucket/input output=s3://bucket/output
       tensorboard: enabled
       timeout: "30 mins"
Since we have different Docker image, one is using GPU and one is not using GPU, we can specify different environment and resource constraint.
Manage environments for experimentβ
Please refer to environment-implementation.md for more details
Manage storages for experimentβ
There're different types of storage, such as logs, metrics, dependencies (environments). For more details. Please refer to storage-implementations for more details. This also includes how to manage code for experiment code.
Manage Pre-defined experiment librariesβ
Flow: Submit an experimentβ
Submit via SDK Flows.β
To better understand experiment implementation, It will be good to understand what is the steps of experiment submission.
Please note that below code is just pseudo code, not official APIs.
Specify what environment to useβ
Before submit the environment, you have to choose what environment to choose. Environment defines dependencies, etc. of an experiment or a notebook. might looks like below:
conda_environment =
"""
  name: conda-env
  channels:
    - defaults
  dependencies:
    - asn1crypto=1.3.0=py37_0
    - blas=1.0=mkl
    - ca-certificates=2020.1.1=0
    - certifi=2020.4.5.1=py37_0
    - cffi=1.14.0=py37hb5b8e2f_0
    - chardet=3.0.4=py37_1003
  prefix: /opt/anaconda3/envs/conda-env
"""
# This environment can be different from notebook's own environment
environment = create_environment {
    DockerImage = "ubuntu:16",
    CondaEnvironment = conda_environment
}
To better understand how environment works, please refer to environment-implementation.
Create experiment, specify where's training code located, and parameters.β
For  ad-hoc experiment (code located at S3), assume training code is part of the training-job.tar.gz and main class is train.py. When the job is launched, whatever specified in the localize_artifacts will be downloaded.
experiment = create_experiment {
    Environment = environment,
    ExperimentConfig = {
       type = "adhoc",
       localize_artifacts = [
            "s3://bucket/training-job.tar.gz"
       ],
       name = "abc",
       parameter = "python training.py --iteration 10 --input="s3://bucket/input output="s3://bucket/output",
    }
}
experiment.run()
experiment.wait_for_finish(print_output=True)
Run notebook file in offline modeβ
It is possible we want to run a notebook file in offline mode, to do that, here's code to use to run a notebook code
experiment = create_experiment {
    Environment = environment,
    ExperimentConfig = {
       type = "adhoc",
       localize_artifacts = [
            "s3://bucket/folder/notebook-123.ipynb"
       ],
       name = "abc",
       parameter = "runipy training.ipynb --iteration 10 --input="s3://bucket/input output="s3://bucket/output",
    }
}
experiment.run()
experiment.wait_for_finish(print_output=True)
Run pre-defined experiment libraryβ
experiment = create_experiment {
    # Here you can use default environment of library
    Environment = environment,
    ExperimentConfig = {
       type = "template",
       name = "abc",
       # A unique name of template
       template = "deepfm_ctr",
       # yaml file defined what is the parameters need to be specified.
       parameter = {
           Input: "S3://.../input",
           Output: "S3://.../output"
           Training: {
              "batch_size": 512,
              "l2_reg": 0.01,
              ...
           }
       }
    }
}
experiment.run()
experiment.wait_for_finish(print_output=True)
Summarize: Experiment v.s. Notebook sessionβ
There's a common misunderstanding about what is the differences between running experiment v.s. running task from a notebook session. We will talk about differences and commonalities:
Differences
| Experiment | Notebook Session | |
|---|---|---|
| Run mode | Offline | Interactive | 
| Output Artifacts (a.k.a model) | Persisted in a shared storage (like S3/NFS) | Local in the notebook session container, could be ephemeral | 
| Run history (meta, logs, metrics) | Meta/logs/metrics can be traced from experiment UI (or corresponding API) | No run history can be traced from Submarine UI/API. Can view the current running paragraph's log/metrics, etc. | 
| What to run? | Code from Docker image or shared storage (like Tarball on S3, Github, etc.) | Local in the notebook's paragraph | 
Commonalities
| Experiment & Notebook Session | |
|---|---|
| Environment | They can share the same Environment configuration | 
Experiment-related modules inside Submarine-serverβ
(Please refer to architecture of submarine server for more details)
Experiment Managerβ
The experiment manager receives the experiment requests, persisting the experiment metas in a database(e.g. MySQL), will invoke subsequence modules to submit and monitor the experiment's execution.
Compute Cluster Managerβ
After experiment accepted by experiment manager, based on which cluster the experiment intended to run (like mentioned in the previous sections, Submarine supports to manage multiple compute clusters), compute cluster manager will returns credentials to access the compute cluster. It will also be responsible to create a new compute cluster if needed.
For most of the on-prem use cases, there's only one cluster involved, for such cases, ComputeClusterManager returns credentials to access local cluster if needed.
Experiment Submitterβ
Experiment Submitter handles different kinds of experiments to run (e.g. ad-hoc script, distributed TF, MPI, pre-defined templates, Pipeline, AutoML, etc.). And such experiments can be managed by different resource management systems (e.g. K8s, container cloud, etc.)
To meet the requirements to support variant kinds of experiments and resource managers, we choose to use plug-in modules to support different submitters (which requires jars to submarine-serverβs classpath).
To avoid jars and dependencies of plugins break the submarine-server, the plug-ins manager, or both. To solve this issue, we can instantiate submitter plug-ins using a classloader that is different from the system classloader.
Submitter Plug-insβ
Each plug-in uses a separate module under the server-submitter module. As the default implements, we provide for K8s.
The submitter-k8s plug-in is used to submit the job to Kubernetes cluster and use the operator as the runtime. The submitter-k8s plug-in implements the operation of CRD object and provides the java interface. In the beginning, we use the tf-operator for the TensorFlow.
If Submarine want to support the other resource management system in the future, such as submarine-docker-cluster (submarine uses the Raft algorithm to create a docker cluster on the docker runtime environment on multiple servers, providing the most lightweight resource scheduling system for small-scale users). We should create a new plug-in module named submitter-docker under the server-submitter module.
Experiment Monitorβ
The monitor tracks the experiment life cycle and records the main events and key info in runtime. As the experiment run progresses, the metrics are needed for evaluation of the ongoing success or failure of the execution progress. Due to adapt the different cluster resource management system, so we need a generic metric info structure and each submitter plug-in should inherit and complete it by itself.
Invoke flows of experiment-related componentsβ
 +-----------------+  +----------------+ +----------------+ +-----------------+
 |Experiments      |  |Compute Cluster | |Experiment      | | Experiment      |
 |Mgr              |  |Mgr             | |Submitter       | | Monitor         |
 +-----------------+  +----------------+ +----------------+ +-----------------+
          +                    +                  +                  +
 User     |                    |                  |                  |
 Submit   |+------------------------------------->+                  +
 Xperiment|          Use submitter.validate(spec) |                  |
          |          to validate spec and create  |                  |
          |          experiment object (state-    |                  |
          |          machine).                    |                  |
          |                                       |                  |
          |          The experiment manager will  |                  |
          |          persist meta-data to Database|                  |
          |                    |                  |                  |
          |                    |                  +                  +
          |+-----------------> +                  |                  |
          |  Submit Experiments|                  |                  |
          |   To ComputeCluster|                  |                  |
          |   Mgr, get existing|+---------------->|                  |
          |   cluster, or      |  Use Submitter   |                  |
          |   create a new one.|  to submit       |+---------------> |
          |                    |  Different kinds |  Once job is     |
          |                    |  of experiments  |  submitted, use  |+----+
          |                    |  to k8s, etc|  monitor to get  |     |
          |                    |                  |  status updates  |     |
          |                    |                  |                  |     | Monitor
          |                    |                  |                  |     | Xperiment
          |                    |                  |                  |     | status
          |                    |                  |                  |     |
          |<--------------------------------------------------------+|     |
          |                    |                  |                  |     |
          |                  Update Status back to Experiment        |     |
          |                    |      Manager     |                  |<----+
          |                    |                  |                  |
          |                    |                  |                  |
          |                    |                  |                  |
          v                    v                  v                  v
TODO: add more details about template, environment, etc.
Common modules of experiment/notebook-session/model-servingβ
Experiment/notebook-session/model-serving share a lot of commonalities, all of them are:
- Some workloads running on K8s.
- Need persist meta data to DB.
- Need monitor task/service running status from resource management system.
We need to make their implementation are loose-coupled, but at the same time, share some building blocks as much as possible (e.g. submit PodSpecs to K8s, monitor status, get logs, etc.) to reduce duplications.
Support Predefined-experiment-templatesβ
Predefined Experiment Template is just a way to save data-scientists time to repeatedly entering parameters which is not error-proof and user experience is also bad.
Predefined-experiment-template API to run experimentβ
Predefined experiment template consists a list of parameters, each of the parameter has 4 properties:
| Key | Required | Default Value | Description | 
|---|---|---|---|
| Name of the key | true/false | When required = false, a default value can be provided by the template | Description of the parameter | 
For the example of deepfm CTR training experiment mentioned in the architecture-and-requirements.md
{
  "input": {
    "train_data": ["hdfs:///user/submarine/data/tr.libsvm"],
    "valid_data": ["hdfs:///user/submarine/data/va.libsvm"],
    "test_data": ["hdfs:///user/submarine/data/te.libsvm"],
    "type": "libsvm"
  },
  "output": {
    "save_model_dir": "hdfs:///user/submarine/deepfm",
    "metric": "auc"
  },
  "training": {
    "batch_size" : 512,
    "field_size": 39,
    "num_epochs": 3,
    "feature_size": 117581,
    ...
  }
}
The template will be (in yaml format):
# deepfm.ctr template
name: deepfm.ctr
author:
description: >
  This is a template to run CTR training using deepfm algorithm, by default it runs
  single node TF job, you can also overwrite training parameters to use distributed
  training.
parameters:
  - name: input.train_data
    required: true
    description: >
      train data is expected in SVM format, and can be stored in HDFS/S3
    ...
  - name: training.batch_size
    required: false
    default: 32
    description: This is batch size of training
The batch format can be used in UI/API.
Handle Predefined-experiment-template from server sideβ
Please note that, the conversion of predefined-experiment-template will be always handled by server. The invoke flow looks like:
                         +------------Submarine Server -----------------------+
   +--------------+      |  +-----------------+                               |
   |Client        |+------->|Experimment Mgr  |                               |
   |              |      |  |                 |                               |
   +--------------+      |  +-----------------+                               |
                         |          +                                         |
          Submit         |  +-------v---------+       Get Experiment Template |
          Template       |  |Experiment       |<-----+From pre-registered     |
          Parameters     |  |Template Registry|       Templates               |
          to Submarine   |  +-------+---------+                               |
          Server         |          |                                         |
                         |  +-------v---------+       +-----------------+     |
                         |  |Deepfm CTR Templ-|       |Experiment-      |     |
                         |  |ate Handler      +------>|Tensorflow       |     |
                         |  +-----------------+       +--------+--------+     |
                         |                                     |              |
                         |                                     |              |
                         |                            +--------v--------+     |
                         |                            |Experiment       |     |
                         |                            |Submitter        |     |
                         |                            +--------+--------+     |
                         |                                     |              |
                         |                                     |              |
                         |                            +--------v--------+     |
                         |                            |                 |     |
                         |                            | ......          |     |
                         |                            +-----------------+     |
                         |                                                    |
                         +----------------------------------------------------+
Basically, from Client, it submitted template parameters to Submarine Server, inside submarine server, it finds the corresponding template handler based on the name. And the template handler converts input parameters to an actual experiment, such as a distributed TF experiment. After that, it goes the similar route to validate experiment spec, compute cluster manager, etc. to get the experiment submitted and monitored.
Predefined-experiment-template is able to create any kind of experiment, it could be a pipeline:
   +-----------------+                  +------------------+
   |Template XYZ     |                  | XYZ Template     |
   |                 |+---------------> | Handler          |
   +-----------------+                  +------------------+
                                                   +
                                                   |
                                                   |
                                                   |
                                                   |
                                                   v
             +--------------------+      +------------------+
             | +-----------------+|      | Predefined       |
             | |  Split Train/   ||<----+| Pipeline         |
             | |  Test data      ||      +------------------+
             | +-------+---------+|
             |         |          |
             | +-------v---------+|
             | |  Spark Job ETL  ||
             | |                 ||
             | +-------+---------+|
             |         |          |
             | +-------v---------+|
             | | Train using     ||
             | | XGBoost         ||
             | +-------+---------+|
             |         |          |
             | +-------v---------+|
             | | Validate Train  ||
             | | Results         ||
             | +-----------------+|
             |                    |
             +--------------------+
Template can be also chained to reuse other template handlers
   +-----------------+                  +------------------+
   |Template XYZ     |                  | XYZ Template     |
   |                 |+---------------> | Handler          |
   +-----------------+                  +------------------+
                                                   +
                                                   |
                                                   v
               +------------------+      +------------------+
               |Distributed       |      | ABC Template     |
               |TF Experiment     |<----+| Handler          |
               +------------------+      +------------------+
Template Handler is a callable class inside Submarine Server with a standard interface defined like.
interface ExperimentTemplateHandler {
   ExperimentSpec createExperiment(TemplatedExperimentParameters param)
}
We should avoid users to do coding when they want to add new template, we should have several standard template handler to deal with most of the template handling.
Experiment templates can be registered/updated/deleted via Submarine Server's REST API, which need to be discussed separately in the doc. (TODO)