Benchmarking Triton (TensorRT) Inference Server for Transformer Models
Summary We investigate NVIDIA's Triton (TensorRT) Inference Server [https://docs.nvidia.com/deeplearning/sdk/triton-inference-server-guide/docs/index.html] as a way of hosting Transformer Language Models. The blog is roughly divided into two parts: (i) instructions for setting up your own inference server, and (ii) benchmarking experiments. The
20 Apr 2020 • #engineering