+ Gradient

How to use Gradient and



Use the popular PyTorch framework to train models on Gradient.

PyTorch is an open source ML framework developed by Facebook's AI Research lab (FAIR) for both training and deploying ML models. PyTorch is widely popular in the research community but is also used in large production environments.

Gradient supports any version of TensorFlow for Notebooks, Experiments, or Jobs. In Gradient, the ML framework used to execute workloads runs within a Docker container. Containers are lightweight and portable environments that can easily be customized to include various framework versions and other libraries. Any Docker container is supported on the Gradient platform. This flexibility makes it easy to switch between different frameworks, to update them from one version to another, and to incorporate other libraries to be used alongside the framework itself.

A set of pre-built containers is provided out of the box though any container can be used if hosted on a public or private container registry.

Running workloads with PyTorch

When launching a workload via the web interface, CLI, or automatically via a pipeline step, you can simply pass-in a Docker image path (e.g. <inline-code>paperspace/dl-containers:pytorch-py36-cu100-jupyter<inline-code>. There are also several pre-configured templates available. These templates are updated regularly and tuned to run well Gradient.

A set of pre-built containers can be used as a starting point within Gradient

When using the CLI, the command would like something like this:

gradient notebooks create \  
--name "my job" \
--container paperspace/dl-containers:pytorch-py36-cu100-jupyter \
--machineType P5000 \
--command "/paperspace/"
--projectId your-project-id

Distributed training with MPI

Gradient offers push-button distributed / multi-node training as a first-class citizen. Scaling-out your workloads with an MPI-based architecture like Horovod doesn't require any background in DevOps and can be accomplished with a few additional lines of code. By specifying the <inline-code>multinode<inline-code> mode and a few additional parameters, you can take any PyTorch model and execute training across as many nodes as desired. Learn more in the docs.