BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation


TL;DR: BLIP is a new pre-training framework for unified vision-language understanding and generation, which achieves state-of-the-art results on a wide range of vision-language tasks.


Background

For a review of some terms and definitions used in this blog, see our Appendix.

Vision and language, two of the most fundamental methods for humans to perceive the world, are also two key cornerstones of AI. A longstanding goal of AI has been to build intelligent agents that can understand the world through vision and language inputs, and communicate with humans through natural language.

In order to achieve this goal, vision-language pre-training has emerged as an effective approach, where deep neural network models are pre-trained on large scale image-text datasets to improve performance on downstream vision-language tasks, such as image-text retrieval, image captioning, and visual question answering.

In short, vision-language pre-training aims to utilize image-text data to teach a model the ability to jointly comprehend visual and textual information. With pre-training, the model has been trained before it is fine-tuned (Fine-tuning involves additional training of the pre-trained model, using data from the downstream task.). Without pre-training, the model needs to be trained from scratch on each downstream task, which leads to degraded performance.

Limitations: Most Models Lack Flexibility, Web Data is Noisy

Despite the tremendous success of vision-language pre-training, existing methods have two major limitations:

  • From the model perspective, most existing pre-trained models are not flexible enough to adapt to a wide range of vision-language tasks. Encoder-based models are less straightforward to directly transfer to text generation tasks, whereas encoder-decoder models have not been successfully adopted for image-text retrieval tasks.

  • From the data perspective, most models pre-train on image and alt-text pairs that are automatically collected from the web. However, the web texts often do not accurately describe the visual content of the images, making them a noisy source of supervision.

Our Solution: Flip the Script with BLIP

To address these limitations, we propose BLIP: Bootstrapping Language-Image Pre-training for unified vision-language understanding and generation. BLIP introduces:

  • a new model architecture that enables a wider range of downstream tasks than existing methods, and
  • a new dataset bootstrapping method for learning from noisy web data.

BLIP achieves state-of-the-art performance on seven vision-language tasks, including:

  • image-text retrieval
  • image captioning
  • visual question answering
  • visual reasoning
  • visual dialog
  • zero-shot text-video retrieval
  • zero-shot video question answering.

Sample Results

Check out the example image below, where Salesforce CEO Marc Benioff visits Singapore. BLIP recognizes the iconic view of Singapore in the generated caption, and answers a diverse set of questions! (Try our easy-to-use demo for yourself, with your own images, at https://huggingface.co/spaces/Salesforce/BLIP.)


Deep Dive: How BLIP Works

A unified model for vision-language understanding and generation

In order to pre-train a unified vision-language model with both understanding and generation capabilities, BLIP introduces multimodal mixture of encoder-decoder, a multi-task model which can operate in one of the three functionalities:

  1. Unimodal encoders, which separately encode image and text. The image encoder is a vision transformer. The text encoder is the same as BERT. A [CLS] token is appended to the beginning of the text input to summarize the sentence.
  2. Image-grounded text encoder, which injects visual information by inserting a cross-attention layer between the self-attention layer and the feed forward network for each transformer block of the text encoder. A task-specific [Encode] token is appended to the text, and the output embedding of [Encode] is used as the multimodal representation of the image-text pair.
  3. Image-grounded text decoder, which replaces the bi-directional self-attention layers in the text encoder with causal self-attention layers. A special [Decode] token is used to signal the beginning of a sequence.

BLIP jointly optimizes three objectives during pre-training, with two understanding-based objectives (ITC, ITM) and one generation-based objective (LM):

  • Image-Text Contrastive Loss (ITC) activates the unimodal encoder. It aims to align the feature space of the visual transformer and the text transformer by encouraging positive image-text pairs to have similar representations in contrast to the negative pairs.
  • Image-Text Matching Loss (ITM) activates the image-grounded text encoder. ITM is a binary classification task, where the model is asked to predict whether an image-text pair is positive (matched) or negative (unmatched) given their multimodal feature.
  • Language Modeling Loss (LM) activates the image-grounded text decoder, which aims to generate textual descriptions conditioned on the images.

On different downstream tasks, we finetune different paths of the pre-trained model to achieve different objectives as shown in the animation below.


Bootstrap captions from noisy image-text pairs

Vision-language pre-training relies on large-scale image-text pairs automatically collected from the web. However, the texts often do not accurately describe the visual content of the image, making them a noisy supervision.

To address this, we bootstrap the captions by introducing two modules: a captioner and a filter.

  • The captioner is an image-grounded text decoder. Given the web images, we use the captioner to generate synthetic captions as additional training samples.
  • The filter is an image-grounded text encoder. It removes noisy captions which do not match their corresponding images.

Dealing With Noisy Data: Examples

A first example of caption bootstrapping is shown in the figure below. The web caption “blue sky bakery in sunset park” is considered to be noisy and removed from training data, whereas the synthetic caption “chocolate cake with cream frosting and chocolate sprinkles on top” is added to the training data.

Here are more examples of the synthetic captions (green) and the noisy web captions (red):


Performance Results

By bootstrapping the dataset, our pre-trained model achieves substantial performance improvements on the downstream tasks, as shown in the table below.

We also found that using a stochastic decoding method (nucleus sampling) is better than using beam search for caption generation, due to the higher level of diversity in the synthetic captions.


Below we show the performance of BLIP on image-text retrieval, where it outperforms the existing state-of-the-art - ALBEF - by +2.7% in average recall@1, using the same amount of images. (Please see our paper for more results on other tasks.)


The Impact

The BLIP research has benefits for AI and beyond:

  • AI benefits: BLIP’s contributions to Artificial Intelligence include:
    • Produces state-of-the-art vision-language pre-trained models for unified image-grounded text understanding and generation tasks
    • Introduces a new framework for learning from noisy web data
      • Deals with noise by:
        • generating synthetic captions as additional training samples
        • removing noisy captions
  • Wider (general) impact: BLIP can enable a wide range of downstream applications with better vision-language intelligence, such as product recommendation and classification in e-commerce platforms.

The Bottom Line

  • Vision-language research is:
    • A core AI problem because vision and language are two fundamental modalities of information in the world
    • An important applied area because many industrial AI applications are powered by vision-language intelligence.
  • Our framework, called BLIP, introduces:
    • A new model architecture that enables a wider range of downstream tasks than existing methods
    • A new dataset bootstrapping method for learning from noisy web data.
  • The BLIP framework makes valuable contributions to deep learning and AI:
    • Produces state-of-the-art vision-language pre-trained models for unified image-grounded text understanding and generation tasks
    • BLIP’s new framework for learning from noisy web data is valuable because web-gathered image descriptions are often not accurate - i.e., noisy.
    • Offers simple, flexible, and powerful vision-language models that can be finetuned end-to-end
      • We finetune different paths of the pre-trained model to achieve different objectives on different downstream tasks
    • Achieves state-of-the-art performance on image-text retrieval, image captioning, visual question answering, visual reasoning, and visual dialog.
    • Demonstrates strong generalization ability when directly transferred to video-language tasks in a zero-shot manner.
  • BLIP also offers wider benefits:
    • Enables a wide range of downstream applications with better vision-language intelligence, such as product recommendation and classification in e-commerce platforms.
  • Demo examples show BLIP can:
    • Generate accurate and detailed image captions
    • Generate accurate answers for a diverse set of questions.
  • We have released our code, models, and bootstrapped datasets to facilitate vision-language research and industrial applications.

Explore More

ALBEF (ALign BEfore Fuse):

About the Authors

Junnan Li is a Lead Research Scientist at Salesforce Research. His ultimate research goal is to build generic AI models that can self-learn without human supervision.

Steven C.H. Hoi is currently the Managing Director of Salesforce Research Asia at Salesforce and oversees Salesforce's AI research and development activities in APAC. His research interests include machine learning and a broad range of AI applications.

Donald Rose is a Technical Writer at Salesforce AI Research. He works on writing and editing blog posts, video scripts, media/PR material, and other content, as well as helping researchers transform their work into publications geared towards a wider (less technical) audience.

Appendix: Terms and Definitions

A review of some AI / deep learning terms used in our discussion:

  • Image-text dataset: a collection of data where each item is an image-text pair — that is, a combination of one or more images plus one or more pieces of text description
  • Pre-training: the model has been trained before it is adapted (or fine-tuned) on downstream task data
  • Fine-tuning: further training the pre-trained model using data from target tasks
  • End-to-end: all the parameters of the model can be trained jointly
  • Encoder vision-language model: a type of model that encodes image-text data into feature representation, which is usually used to perform understanding-based tasks
  • Encoder-Decoder vision-language model: a type of model which first encodes image-text into multimodal features and then decodes the features into text
  • Dataset Bootstrapping: a way of generating additional (synthetic) data for the system to use in building its model.