WarpDrive: Extremely Fast Reinforcement Learning on an NVIDIA GPU

8 min read

Stephan Zheng

Sunil Srinivasa

Tian Lan

tldr: WarpDrive is an open-source framework to do multi-agent RL end-to-end on a GPU. It achieves orders of magnitude faster multi-agent RL training with 2000 environments and 1000 agents in a simple Tag environment. WarpDrive provides lightweight tools and workflow objects to build your own fast RL workflows. Check out the code, this blog, and the white paper for more details!

The name WarpDrive is inspired by the science fiction concept of a fictional superluminal spacecraft propulsion system. Moreover, at the time of writing, a "warp" is a group of 32 threads that are executing at the same time in (certain) GPUs.

The Challenge of Multi-Agent RL

Multi-agent systems, particularly those with multiple interacting AI agents, are a frontier for AI research and applications. They are key to solving many engineering and scientific challenges in economics, self-driving cars, and robotics, and many other fields. Deep reinforcement learning (RL) is a powerful learning framework to train AI agents. Deep RL agents have mastered Starcraft [1], successfully trained robotic arms [2], and effectively recommended economic policies [3,4].

However, multi-agent deep RL (MADRL) experiments can take days or even weeks, especially when a large number of agents is trained. MADRL requires repeatedly running multi-agent simulations and training agent models. This takes a lot of time because MADRL implementations often combine CPU-based simulations with GPU deep learning model. For example, the Foundation economic simulation framework [5] follows this pattern.

This introduces many performance bottlenecks. For instance, CPUs do not parallelize computations well across agents and across environments, and data transfers between CPU and GPU are inefficient.

To accelerate MADRL research and engineering, we built WarpDrive, an open-source framework library for extremely fast MADRL. WarpDrive runs MADRL entirely on a GPU and so achieves orders of magnitude faster training. In the animation below, you can see an example of agents trained with WarpDrive.

Keep reading to find out how it works!

Tag: 5 taggers run after 100 runners. Agents receive partial observations of their environment. Runners can reach twice the maximal speed as the taggers. All agents were trained using WarpDrive.

How Does WarpDrive Work?

WarpDrive defines a lightweight Python API to define and access low-level GPU data structures such as Pytorch Tensors and primitive GPU arrays. It builds on PyCUDA to let you easily communicate between Python, Pytorch and CUDA C.

This lets you run both the simulations and agents on the GPU. The RL agents and their respective models can undergo training on the GPU while accessing and interacting with the simulations running concurrently on the GPU directly through the shared Pytorch Tensors.

WarpDrive currently runs each environment in a CUDA block. Each environment uses the same simulation logic. Within each block, each CUDA thread simulates one agent. Agents can use the same or different neural network models. Interactions between agents can be modeled through the shared block memory that all threads can access.

This approach is extremely fast:

It can simulate 1000s of agents or more in each environment and thousands of environments in parallel, using the extreme parallelism capability of GPUs.
It eliminates communication between CPU and GPU.
It eliminates copying of data within the GPU: Pytorch and CUDA C simulations read and write to the same data structures.
There is only a tiny and constant data transfer cost between the CPU and GPU that does not scale with the number of agents and number of iterations.
It is fully compatible with Pytorch, a highly flexible and very fast deep learning framework.
It implements parallel action sampling on CUDA C, which is ~3x faster than using Pytorch’s sampling methods.

WarpDrive also enables you to quickly develop new MADRL projects. For your own simulation, you need to implement the simulation step function in CUDA C.

WarpDrive makes the rest of the MADRL workflow easy and fast. It automatically makes agent-parallel and environment-parallel rollout methods (step, reset, etc), fast action samplers, and data loggers. We also provide tools to check consistency between CPU and GPU environment step functions.

Example: Learning to Play Tag with 1000s Agents in 1000s of Environments

Tag with different settings. Left: Taggers are 2x as fast as the runners. Middle: Equal speed. Right: Runners are 2x as fast as the taggers.

In Tag, N taggers work together to catch M runners. Runners are tagged once a tagger gets close enough. Each agent learns how to optimally accelerate (or brake) and turn left or right. Each simulation episode ends after 500 time steps, or when all runners have been tagged. At the end of an episode, the percentage of runners that were tagged defines how successful the taggers have been.

We can use RL to optimize the policy of the taggers and runners. Taggers are rewarded +1 for each successful tag, so they're incentivized to tag the runners. Once a runner is tagged, it receives a penalty of -1 and leaves the game. Therefore, runners learn to avoid being tagged.

Tag quickly becomes a complicated decision-making problem once more and more taggers and runners participate. RL agents may learn cooperative strategies, for instance, taggers might learn to encircle runners.

Below you can see Tag on a discrete grid with one runner trying to escape from other taggers. At each time step, each agent stays or moves up, down, left, or right by 1 cell. The reward structure is similar to that in the continuous version of Tag.

Tag on a discrete grid with 1 runner and many taggers. All agents have been trained using RL and WarpDrive.

Benchmarks

We benchmarked WarpDrive with the Tag environment by comparing performance when using CPU-simulations and GPU-agent models versus running purely on an NVIDIA Tesla V100 GPU.

Training Iteration Speed

In discrete Tag, for 2000 environments with 5 agents each, WarpDrive runs at 9.8 million environment steps per second, or 2.9 million environment steps per second for (2000 environments, 1000 agents). The batched training data saturates the GPU’s memory with 2000 environments. WarpDrive samples 18 million actions per second independent of the number of agents (vs 5 million actions with Pytorch). With 2000 environments and 5 agents, WarpDrive can handle up to 1.3 million end-to-end RL training iterations per second with a single GPU.

In continuous Tag, for 2000 environments with 5 agents each, WarpDrive runs at 8.3 million environment steps per second, or 0.18 million environment steps per second for (2000 environments, 105 agents). The batched training data saturates the GPU's memory with 2000 environments. WarpDrive samples 16 million actions per action category per second independent of the number of agents. With 2000 environments and 5 agents, WarpDrive can handle up to 0.58 million end-to-end RL training iterations per second with a single GPU with two – tagger and runner – neural network policy models.

Nearly Perfect Parallelism

WarpDrive achieves nearly perfect parallelism.

The performance of WarpDrive in discrete Tag scales linearly to over thousands of environments, keeping the number of agents constant, achieving almost perfect parallelism over environments.

WarpDrive performance in the discrete version of Tag for increasing numbers of parallel environments.

In the Figure below, you can compare performance as the number of agents grows. WarpDrive achieves more than 50x speed up compared to a NumPy version on a single CPU, for up to 1000 agents.

WarpDrive performance in the discrete version of Tag for increasing numbers of agents.

We evaluate with agents using both partial observations or full observations. With partial observations, agents can only see nearby agents of the other type. With full observations, agents can see all other agents. Note that when using partial observations, the step function has much better performance than O((number of agents)^2) complexity.

Similar speedups hold for the continuous version of Tag.

WarpDrive performance in the continuous version of Tag.

Key Takeaways and Next Steps

WarpDrive enables you to run and train across many RL environments and agents in parallel on a GPU. This approach improves MADRL training speed by orders of magnitude. The tools provided by WarpDrive can significantly accelerate MADRL research and development.

Following this release, we intend to continue developing WarpDrive:

We want to make simulation development in CUDA C easier and faster.
We will explore ways to increase training speed even further.
We want to develop safe memory management primitives.
We want to explore integrating with other deep learning, RL, and GPU development tools.

If you’re interested in contributing, send a pull request on our Github repository or join us on Slack. We can’t wait to see what you build using WarpDrive and welcome your contributions!

Learn More and Contribute!

Get the code and check out the tutorials by cloning our Github repo.
Get the white paper for more details.
Join us on Slack to contribute to developing WarpDrive!

About the Authors

Tian Lan is a Senior Research Scientist at Salesforce Research. He is working for both the AI Operations team and the AI Economist team. For AI Operations, he is focusing on the multivariate time series forecasting models and production. For AI Economist, his main focus is on building and scaling multi-agent reinforcement learning systems. He has extensive experience building up large-scale and massively parallel computational simulation platforms for academia, automated trading and high tech industry. Tian holds a Ph.D. major in Applied Physics and a Ph.D. minor in Electrical Engineering (2014) from Caltech.

Sunil Srinivasa is a Research Engineer at Salesforce Research, leading the engineering efforts on the AI Economist team. He is broadly interested in machine learning, with a focus on deep reinforcement learning. He is currently working on building and scaling multi-agent reinforcement learning systems. Previously, he has had over a decade of experience in the industry taking on data science and applied machine learning roles. Sunil holds a Ph.D. in Electrical engineering (2011) from the University of Notre Dame.

Stephan Zheng – www.stephanzheng.com – is a Lead Research Scientist and heads the AI Economist team at Salesforce Research. He works on using deep reinforcement learning and economic simulations to design economic policy – media coverage includes the Financial Times, Axios, Forbes, Zeit, Volkskrant, MIT Tech Review, and others. He holds a Ph.D. in Physics from Caltech (2018) and interned with Google Research and Google Brain. Before machine learning, he studied mathematics and theoretical physics at the University of Cambridge, Harvard University, and Utrecht University. He received the Lorenz graduation prize from the Royal Netherlands Academy of Arts and Sciences for his thesis on exotic dualities in topological string theory and received the Dutch Huygens scholarship twice.

References

Grandmaster level in StarCraft II using multi-agent reinforcement learning.
Deep reinforcement learning will transform manufacturing as we know it.
The AI Economist: Optimal Economic Policy Design via Two-level Deep Reinforcement Learning. Stephan Zheng, Alexander Trott, Sunil Srinivasa, David C. Parkes, Richard Socher.
Building a Foundation for Data-Driven, Interpretable, and Robust Policy Design using the AI Economist. Alexander Trott, Sunil Srinivasa, Douwe van der Wal, Sebastien Haneuse, and Stephan Zheng.
Foundation: An Economic Simulation Framework.

Acknowledgments

We thank Denise Perez and Huan Wang for their support.