Benchmarking Triton (TensorRT) Inference Server for Transformer Models

Summary We investigate NVIDIA's Triton (TensorRT) Inference Server [https://docs.nvidia.com/deeplearning/sdk/triton-inference-server-guide/docs/index.html] as a way of hosting Transformer Language Models. The blog is roughly divided into two parts: (i) instructions for setting up your own inference server, and (ii) benchmarking experiments. The

20 Apr 2020 • #engineering

Weighted Transformer Network for Machine Translation

Most neural architectures for machine translation use an encoder-decoder model consisting of either convolutional or recurrent layers. The encoder layers map the input to a latent space and the decoder, in turn, uses this latent representation to map the inputs to the targets.

08 Nov 2017 • #research