Skip to main content
Version: master πŸƒ

Storage Implementation

First let's look at what user will interact for most of the time:

  • Notebook
  • Experiment
  • Model Servings


+---------+ +------------+
|Logs |<--+|Notebook |
+----------+ +---------+ +------------+ +----------------+
|Trackings | <-+|Experiment |<--+>|Model Artifacts |
+----------+ +-----------------+ +------------+ +----------------+
+----------+<---+|ML-related Metric|<--+Servings |
|tf.events | +-----------------+ +------------+
+----------+ ^ +-----------------+
+ | Environments |
+----------------------+ | |
+-----------------+ | Submarine Metastore | | Dependencies |
|Code | +----------------------+ | |
+-----------------+ |Experiment Meta | | Docker Images |
+----------------------+ +-----------------+
|Model Store Meta |
+----------------------+
|Model Serving Meta |
+----------------------+
|Notebook meta |
+----------------------+
|Experiment Templates |
+----------------------+
|Environments Meta |
+----------------------+

First of all, all the notebook-sessions / experiments / model-serving instances) are more or less interact with following storage objects:

  • Logs for these tasks for troubleshooting.
  • ML-related metrics such as loss, epoch, etc. (in contrast of system metrics such as CPU/memory usage, etc.)
    • There're different types of ML-related metrics, for Tensorflow/pytorch, they can use tf.events and get visualizations on tensorboard.
    • Or they can use tracking APIs (such as Submarine tracking, mlflow tracking, etc.) to output customized tracking results for non TF/Pytorch workloads.
  • Training jobs of experiment typically generate model artifacts (files) which need persisted, and both of notebook, model serving needs to load model artifacts from persistent storage.
  • There're various of meta information, such as experiment meta, model registry, model serving, notebook, experiment, environment, etc. We need be able to read these meta information back.
  • We also have code for experiment (like training/batch-prediction), notebook (ipynb), and model servings.
  • And notebook/experiments/model-serving need depend on environments (dependencies such as pip, and Docker Images).
Object TypeCharacteristicsWhere to store
Metrics: tf.eventsTime series data with k/v, appendable to fileLocal/EBS, HDFS, Cloud Blob Storage
Metrics: other tracking metricsTime series data with k/v, appendable to fileLocal, HDFS, Cloud Blob Storage, Database
LogsLarge volumes, #files are potentially huge.Local (temporary), HDFS (need aggregation), Cloud Blob Storage
Submarine MetastoreCRUD operations for small meta data.Database
Model ArtifactsSize varies for model (from KBs to GBs). #files are potentially huge.HDFS, Cloud Blob Storage
CodeNeed version control. (Please find detailed discussions below for code storage and localization)Tarball on HDFS/Cloud Blog Storage, or Git
Environment (Dependencies, Docker Image)Public/private environment repo (like Conda channel), Docker registry.

Detailed discussions​

Store code for experiment/notebook/model-serving​

There're following ways to get experiment code:

1) Code is part of Git repo: (Recommended)

This is our recommended approach, once code is part of Git, it will be stored in version control, any change will be tracked, and much easier for users to trace back what change triggered a new bug, etc.

2) Code is part of Docker image:

This is an anti-pattern and we will NOT recommend you to use it, Docker image can be used to include ANYTHING, like dependencies, the code you will execute, or even data. But this doesn't mean you should do it. We recommend to use Docker image ONLY for libraries/dependencies.

Making code to be part of Docker image makes hard to edit code (if you want to update a value in your Python file, you will have to recreate the Docker image, push it and rerun it).

3) Code is part of S3/HDFS/ABFS:

User may want to store their training code to a tarball on a shared storage. Submarine need to download code from remote storage to the launched container before running the code.

Localization of experiment/notebook/model-serving code​

To make user experiences keeps same across different environment, we will localize code to a same folder after the container is launched, preferably /code

For example, there's a git repo need to be synced up for an experiment/notebook/model-serving (example above):

experiment: #Or notebook, model-serving
name: "abc",
environment: "team-default-ml-env"
... (other fields)
code:
sync_mode: git
url: "https://foo.com/training-job.git"

After localize, training-job/ will be placed under /code

When we running on K8s environment, we can use K8s's initContainer and emptyDir to do these things for us. K8s POD spec (generated by Submarine server instead of user, user should NEVER edit K8s spec, that's too unfriendly to data-scientists):

apiVersion: v1
kind: Pod
metadata:
name: experiment-abc
spec:
containers:
- name: experiment-task
image: training-job
volumeMounts:
- name: code-dir
mountPath: /code
initContainers:
- name: git-localize
image: git-sync
command: "git clone .. /code/"
volumeMounts:
- name: code-dir
mountPath: /code
volumes:
- name: code-dir
emptyDir: {}

The above K8s spec create a code-dir and mount it to /code to launched containers. The initContainer git-localize uses https://github.com/kubernetes/git-sync to do the sync up. (If other storages are used such as s3, we can use similar initContainer approach to download contents)

Other than ML-related objects, we have system-related objects, including:

  • Daemon logs (like logs of Submarine server).
  • Logs for other dependency components (like Kubernetes logs when running on K8s).
  • System metrics (Physical resource usages by daemons, launched training containers, etc.).

All these information should be handled by 3rd party system, such as Grafana, Prometheus, etc. And system admins are responsible to setup these infrastructures, dashboard. Users of submarine should NOT interact with system related metrics/logs. It is system admin's responsibility.

Attachable Volumes​

It is possible user has needs to have an attachable volume for their experiment / notebook, this is especially useful for notebook storage, since contents of notebook can be automatically saved, and it can be used as user's home folder.

Downside of attachable volume is, it is not versioned, even notebook is mainly used for adhoc exploring tasks, an unversioned notebook file can lead to maintenance issues in the future.

Since this is a common requirement, we can consider to support attachable volumes in Submarine in a long run, but with relatively lower priority.

In-scope / Out-of-scope​

Describe what Submarine project should own and what Submarine project should NOT own.