Benchmarking Triton (TensorRT) Inference Server for Transformer Models

SummaryWe investigate NVIDIA's Triton (TensorRT) Inference Server as a way of hosting Transformer Language Models. The blog is roughly divided into two parts: (i) instructions for setting up your own inference server, and (ii) benchmarking experiments. The instructions are intended to be detailed and standalone, but readers interested solely in

20 Apr 2020 • #engineering

Weighted Transformer Network for Machine Translation

Most neural architectures for machine translation use an encoder-decoder model consisting of either convolutional or recurrent layers. The encoder layers map the input to a latent space and the decoder, in turn, uses this latent representation to map the inputs to the targets.

08 Nov 2017 • #research