Turbocharge Multi-Agent Reinforcement Learning with WarpDrive and PyTorch Lightning

8 min read

TL;DR: WarpDrive is a flexible, lightweight, easy-to-use end-to-end reinforcement learning (RL) framework; enables orders-of-magnitude faster training on a single GPU. PyTorch Lightning enables you to modularize experimental code, and build production-ready workloads fast. Together, they can help significantly accelerate multi-agent RL R&D.


Reinforcement Learning: Agents Learn by Maximizing Rewards

Reinforcement Learning (RL) is a subfield of Machine Learning (ML) that deals with how intelligent agents should act in an environment when they wish to maximize a reward. This reward can be defined in various ways depending on the domain.

The basic concept of RL is learning through interaction with a simulation or the real world, making minimal assumptions about how that simulation works. As such, RL is very flexible. For instance, RL can optimize a wider range of objectives that do not need to conform to simplifying mathematical forms.

RL has emerged as a promising approach to training AI agents in fields such as strategy games and robotics. In particular, RL with multiple agents (or entities) is a frontier for RL research and applications and is key to solving many challenges in areas such as economics, conversational agents, and robotics.

For example, the AI Economist uses RL to learn economic policies designed to maximize a combination of societal goals using a more realistic model of the world (as opposed to more abstract models that make oversimplifying assumptions). The system can learn tax policies that maximize multiple factors, such as keeping productivity high while improving the social good. In other words, the optimal “reward” in this application of RL might be a high score on both productivity and equality for the tax policy it designs.

Multi-Agent RL: Performance Challenges

While RL’s potential for solving important problems is clear, building engineering systems that perform efficient multi-agent RL training is a massive challenge. Traditional frameworks built for multi-agent RL typically comprise CPUs for running simulation roll-outs and GPUs for training (e.g. MAVA, MENGER, SEED RL), which can be slow and inefficient with experiments taking days or even weeks.

The main performance bottlenecks stem from the fact that there are repeated data transfers between CPU and GPU, and CPUs do not parallelize computations well across agents and environments. This communication is often necessitated because simulations are often implemented using CPU code, while building GPU-based simulations can be tedious. Moreover, there are few integrated solutions that allow users to easily combine GPU simulations with GPU-based model training.

Accelerating Multi-Agent RL with WarpDrive

In an effort to help significantly accelerate multi-agent RL research and development, we recently released WarpDrive - a modular, lightweight, and easy-to-use RL framework that implements end-to-end deep multi-agent RL on a single GPU.

By running simulations across multiple agents and environments in parallel on separate GPU threads, it enables orders-of-magnitude faster RL compared to traditional systems. WarpDrive is also very efficient as it eliminates the back-and-forth data copying between the CPU and the GPU. In particular, all the relevant data is only copied once from the CPU to the GPU’s memory, and all data transformations occur in place.

In essence, WarpDrive provides easy-to-use APIs and utilities to build and train custom multi-agent RL pipelines with just a few lines of code.

WarpDrive + PyTorch Lightning Integration

PyTorch Lightning is a machine learning framework that significantly reduces trainer boilerplate code and improves training modularity and flexibility. It abstracts most of the engineering pieces of code, so users can focus on research and building models, and quickly iterate on experiments. PyTorch Lightning also provides support to easily run models on your own hardware, performing distributed training, model checkpointing, performance profiling, logging, and visualization.

For those who enjoy using PyTorch Lightning, this post covers the integration of Lightning with WarpDrive, which makes multi-agent RL training even easier and faster to implement. As such, the integration of WarpDrive with PyTorch Lightning provides the following benefits:

Simple to set up: Perform end-to-end training of multi-agent RL environments in just a few lines of code. We have provided a starting example in this tutorial.

Significantly reduces training boilerplate: The key components of the training loop such as loss backpropagation, the optimization step, and gradient clipping can be removed from the training code, as they are handled automatically by the PyTorch Lightning Trainer.

Modularizes the code even further: The code is cleaner and more organized as it better separates the data generation piece from the training piece.

Adds support for training callbacks: PyTorch Lightning also allows users to add callbacks that can be invoked at various times during training. This feature enhances code readability and flexibility.

Overall, PyTorch Lightning abstracts most of the engineering pieces of code and provides a modular simulation and training workflow so that users can focus on WarpDrive CUDA environment development, MARL research, and building models.

WarpDrive Deep Dive: Software Architecture

A typical RL workflow in WarpDrive involves only a one-time data copy from the CPU to the GPU at the beginning. All the data arrays reside on the GPU memory and are modified in place during training. Within the GPU, we parallelize generating the simulation roll-outs across multiple agents and environments on dedicated GPU threads and blocks. Once the roll-outs are gathered into a training batch (again, on the GPU memory), we use a PyTorch Lightning-based training loop.

Below is a visual representation of the software modules and their relationships in the PyTorch Lightning and WarpDrive integration:

At a high level, there are five layers:

  • At its core (the CUDA C service layer), WarpDrive relies on CUDA kernels to effectively parallelize the simulation roll-outs across the agents on separate GPU threads. This entails the env reset and step functions to be written in CUDA C.
  • The API layer exposes two Pythonic classes - the data manager and the function manager. The data manager APIs facilitate all the CPU-to-GPU communication such as the data concerning environment configuration parameters, and the observation and reward array placeholders. The function manager provides API methods to initialize and invoke the CUDA C kernel functions required for performing the environment step, generating observations, and computing rewards from the CPU.
  • WarpDrive provides two Pythonic classes at the Python service layer to manage the corresponding CUDA kernels - the EnvironmetReset for automatically resetting any completed environments, and the Sampler for sampling actions, to step through the environment.
  • The application layer provides an env wrapper class to orchestrate the environment reset and step functionalities from the CPU. Users can also build custom PyTorch policy models that will be used when sampling the actions for the environment.
  • The PyTorch Lightning layer leverages the capabilities of PyTorch Lightning to organize the overall training workflow. At a high level, the training pipeline is modularized into a data generation piece (handled by a PyTorch DataLoader) and a training piece (powered by the PyTorch Lightning trainer).

For more details, please see our white paper.

Performance Benchmarks

We benchmark WarpDrive with the continuous Tag environment and also compare the performance with and without the integration with PyTorch Lightning. Our benchmarking experiments ran on an A100 GPU.

Below, we plot the training throughput (in environment steps per second) versus the number of environments (Left) and the number of agents (Right).

WarpDrive achieves nearly perfect parallelism. Specifically, the performance of WarpDrive in continuous Tag scales linearly over thousands of environments, keeping the number of agents constant, achieving almost perfect parallelism over environments. Likewise, with a fixed number of environments, the throughput stays almost constant even as we scale up the number of agents, thus demonstrating parallelism over agents.

Also, integrating WarpDrive with PyTorch Lightning showed minimal impact on training throughput (smaller than 10% overhead) across the board.

In summary, PyTorch Lightning provides numerous added benefits (as mentioned above) without sacrificing performance.


The Bottom Line

  • Building engineering systems for efficient multi-agent RL training has proved to be a challenge. Traditional frameworks to do so comprise CPUs for running simulation roll-outs and GPUs for training that can be slow and inefficient; experiments can take days or even weeks.
  • WarpDrive is a flexible, lightweight, and easy-to-use end-to-end RL framework that enables orders-of-magnitude faster training on a single GPU.
  • PyTorch Lightning helps modularize your experimental code and quickly build production-ready workloads. This pairs great with WarpDrive, which enables you to quickly build RL workflows that are fully GPU-based.
  • Used together, WarpDrive and PyTorch Lightning can help significantly accelerate multi-agent RL research and development.


Explore More

Salesforce AI Research invites you to dive deeper into the concepts discussed in this blog post (links below). Connect with us on social media and our website to get regular updates on this and other research projects.

To learn more details about our work with WarpDrive, please read our white paper.

To dive in and get started with your own projects, please check out our starter notebook, where we perform end-to-end training of a multi-agent (Tag) environment via just a few lines of code.

Check out the WarpDrive tutorials by cloning our Github repo.

If you’re interested in contributing, send a pull request on our Github repository or join us on Slack.

We can’t wait to see what you build using WarpDrive and welcome your contributions!

About the Authors

Sunil Srinivasa is a Research Engineer at Salesforce Research, leading the engineering efforts on the AI Economist team. He is broadly interested in machine learning, with a focus on deep reinforcement learning. He is currently working on building and scaling multi-agent reinforcement learning systems. Previously, he has had over a decade of experience in the industry taking on data science and applied machine learning roles. Sunil holds a Ph.D. in Electrical engineering (2011) from the University of Notre Dame.

Tian Lan is a Senior Research Scientist at Salesforce Research. He is working for both the AI Operations team and the AI Economist team. For AI Operations, he is focusing on the multivariate time series forecasting models and production. For AI Economist, his main focus is on building and scaling multi-agent reinforcement learning systems. He has extensive experience building up large-scale and massively parallel computational simulation platforms for academia, automated trading and high-tech industry. Tian holds a Ph.D. major in Applied Physics and a Ph.D. minor in Electrical Engineering (2014) from Caltech.

Huan Wang is a director at Salesforce Research. He holds a Ph.D degree in Computer Science from Yale. He received the best paper award at the Conference on Learning Theory (COLT) 2012. At Salesforce he works on deep learning theory, reinforcement learning, time series analytics, operational and data intelligence. Previously he was a senior applied scientist at Microsoft AI Research, a research scientist at Yahoo, an adjunct professor at the NYU engineering school teaching machine learning, and an adjunct professor at the Baruch college teaching algorithm design.

Stephan Zheng (www.stephanzheng.com) is a Lead Research Scientist and heads the AI Economist team at Salesforce Research. He works on using deep reinforcement learning and economic simulations to design economic policy – media coverage includes the Financial Times, Axios, Forbes, Zeit, Volkskrant, MIT Tech Review, and others. He holds a Ph.D. in Physics from Caltech (2018) and interned with Google Research and Google Brain. Before machine learning, he studied mathematics and theoretical physics at the University of Cambridge, Harvard University, and Utrecht University. He received the Lorenz Graduation Prize from the Royal Netherlands Academy of Arts and Sciences for his thesis on exotic dualities in topological string theory and was twice awarded the Dutch Huygens Scholarship.

Donald Rose is a Technical Writer at Salesforce AI Research. Specializing in content creation and editing, Dr. Rose works on multiple projects, including blog posts, video scripts, news articles, media/PR material, social media, writing workshops, and more. He also helps researchers transform their work into publications geared towards a wider audience.