TL;DR: LAVIS (short for LAnguage-VISion) is an open-source deep learning library for language-vision research and applications, offering comprehensive support for a wide range of tasks, datasets, and state-of-the-art models. Featuring a unified interface and modular design, it’s easy to use off-the-shelf and to extend with new capabilities. With its powerful features and integrated framework, LAVIS helps make AI language-vision capabilities accessible to a wide audience of researchers and practitioners.
Multimodal content — in particular, language-vision data including text, images, and videos — is ubiquitous in real-world applications such as content recommendation, e-commerce, and entertainment. Each piece of language-vision data contains both textual and visual information — two modes, hence multimodal — which enables some specific language-vision applications to be developed, including generating textual descriptions for images, searching images using language queries, and classifying multimodal content such as product items described by text and images.
Recently, tremendous progress has been made in developing powerful language-vision models, especially language-vision foundation models. These are deep learning models pre-trained on large-scale image-text and video-text pairs (mostly collected from the internet), which can be flexibly transferred to a wide range of downstream tasks and applications, achieving good performance with a minimum amount of finetuning.
While language-vision foundation models have achieved impressive results, they do have some limitations. For example, due to the diversity and complexity of language-vision tasks, training and evaluating these models are not straightforward. The experiment pipeline is a burdensome procedure that involves manual downloading of pre-trained models and task-specific datasets, careful coding to perform model training and evaluation, and miscellaneous tasks such as checkpointing and logging. It is demanding for incoming researchers and practitioners to carry out each step flawlessly. The main reason for such obstacles is the inconsistent interfaces across models, datasets, and task evaluations, as well as the non-trivial effort to prepare the required experiment setup.
Another limitation: most existing language-vision libraries support limited tasks, datasets, and/or obsolete models. For example, MMF (a MultiModal Framework for multimodal AI models) supports mostly task-specific finetuning models with inferior performance; X-modaler (a codebase for cross-modal analytics — image captioning, video captioning, and vision-language pre-training) supports much fewer tasks and datasets, with limited support for foundation models. Others’ ongoing efforts, including TorchMultimodal and UniLM, are largely under development with limited capabilities available.
In addition, the design of these libraries does not facilitate easy access to datasets and models for off-the-shelf use. This creates extra barriers for users who want to take advantage of modeling capabilities and resources.
Finally, most of these libraries do not provide fine-tuned model checkpoints or extensive benchmarking results. This results in additional effort to replicate model performance.
To make emerging language-vision intelligence and capabilities accessible to a wider audience, promote their practical adoption, and reduce repetitive efforts in future development, we built LAVIS (short for LAnguage-VISion), an open-source library which provides a unified interface for:
The goals of LAVIS include:
As Table 1 shows, LAVIS is the most comprehensive language-vision library available today, and we are working to continually improve it. Coming soon: stronger language-vision models, and new capabilities such as text-to-image generation.
Table 1: A head-to-head comparison of LAVIS and existing language-vision libraries/codebases. None of the others match LAVIS’s breadth of features and application areas. Note: language-vision models in UniLM and TorchMultimodal (alpha release) are under development, so the table only includes their supported features by the publication time of this blog post.
Now let’s explore the main features of LAVIS in greater detail.
LAVIS supports a growing list of more than 10 common language-vision tasks, across over 20 public datasets. These tasks and datasets provide a comprehensive and unified benchmark for evaluating language-vision models. In particular, we prioritize tasks that are standard and widely adopted for evaluation, with publicly available datasets. These include:
The LAVIS library enables access to over 30 pre-trained and task-specific finetuned model checkpoints of four popular foundation models: ALBEF, BLIP, CLIP, and ALPRO. These models achieve competitive performance across multiple tasks evaluated using common metrics. We also provide training, evaluation scripts, and configurations to facilitate reproducible language-vision research and adoption.
Table 2: Supported tasks, models, and datasets in LAVIS.
|Image-text Pre-training||ALBEF, BLIP||COCO, Visual Genome, SBU Caption, Conceptual Captions (3M, 12M), LAION|
|Image-text Retrieval||ALBEF, BLIP. CLIP||COCO, Flickr30k|
|Visual Quesiton Answering||ALBEF, BLIP||VQAv2, OKVQA, A-OKVQA|
|Image Captioning||BLIP||COCO Caption, NoCaps|
|Natural Language Visual Reasoning (NLVR2)||ALBEF, BLIP||NLVR2|
|Video-to-text Retrieval||ALPRO, BLIP||MSRVTT, DiDeMo|
|Video Question Answering||ALPRO, BLIP||MSRVTT-QA, MSVD-QA|
The figure below shows the overall architecture of LAVIS. Our key design principle is to provide a simple and unified library to easily (i) train and evaluate the model; (ii) access supported models and datasets; (iii) extend with new models, tasks and datasets.
Key components in the library are organized using a modular design. This allows off-the-shelf access to individual components, swift development, and easy integration of new or external components. The modular design also eases model inferences, such as multimodal feature extraction.
In addition to the core library functionalities, we also provide useful resources to further reduce learning barriers for language-vision research. These include automatic dataset downloading tools to help prepare the supported datasets, a GUI dataset browser to help preview downloaded datasets, and dataset cards documenting sources, supported tasks, common metrics, and leaderboards.
Pre-trained and finetuned model checkpoints. We include pre-trained and finetuned model checkpoints in the library. This promotes easy replication of our experimental results and repurposing of pre-trained models for other applications. Model checkpoints are downloaded automatically upon loading models.
Web Demo: As shown in the Figure below, we developed a GUI-based web demo with a user-friendly interface to explore various multimodal capabilities. Currently the demo supports the following functionalities:
Automatic dataset downloading and browsing: preparing language-vision datasets for pre-training and finetuning incurs much duplicating effort. To address this, LAVIS provides tools to automatically download and organize the public datasets, so that users can get easier and quicker access to common datasets. In addition, we also developed a GUI dataset browser, as shown in the Figure below, that helps users to rapidly gain intuitions about the data they use.
We feel confident that the overwhelming impact of LAVIS will be positive.
However, it is possible that misuse of LAVIS may result in unwanted impacts.
Dongxu Li is a Research Scientist at Salesforce Research. His research focuses on multimodal understanding and its applications.
Junnan Li is a Senior Research Manager at Salesforce Research. His current research focuses on vision and language AI. His ultimate research goal is to build generic AI models that can self-learn, without human supervision.
Steven C.H. Hoi is Managing Director of Salesforce Research Asia and oversees Salesforce's AI research and development activities in APAC. His research interests include machine learning and a broad range of AI applications.
Donald Rose is a Technical Writer at Salesforce Research. Specializing in content creation and editing, he works on multiple projects including blog posts, video scripts, news articles, media/PR material, writing workshops, and more. His passions include helping researchers transform their work into publications geared towards a wider audience, and writing think pieces about AI.
Foundation models: Deep learning models pre-trained on very large amounts of data, which can perform well when transferred to a wide range of different tasks with minimum amounts of finetuning.
ViT: Vision Transformer. A visual model (based on the transformer architecture, which has proven popular for text-based tasks). A ViT model represents an input image as a sequence of fixed-size, non-overlapping image patches, analogous to the word embeddings used when applying transformers to text.