BLIP-2: Scalable Pre-training of Multimodal Foundation Models for the World's First Open-source Multimodal Chatbot

5 min read

TL;DR: We propose BLIP-2, a scalable multimodal pre-training method that enables any Large Language Models (LLMs) to ingest and understand images, unlocks the capabilities of zero-shot image-to-text generation and powers the world’s first open-sourced multimodal Chatbot prototype.

OpenAI just released GPT-4, a powerful new multimodal AI model with its eye-catching capability of accepting image inputs to generate text. However, such capability is not new, which has been shown in our recent BLIP-2 models and prototype released on 30 January 2023. Our novel BLIP-2 method enables us to build the world’s first open-sourced multimodal chatbot prototype. Below we discuss the differences between our BLIP-2 model and OpenAI’s GPT-4.

BLIP-2 vs. GPT-4

  • Generic vs. Specific: BLIP-2 is a novel and generic multimodal pre-training methodology for vision-language pretraining, which can enable any family of LLMs to understand images and unlock zero-shot image-to-text generation capabilities. GPT-4 is a specific type of pre-trained model and its technical novelty is unclear (not disclosed).
  • Open-source vs. Closed-source (API-only): The code and models of BLIP-2 are open-sourced in the LAVIS library ( and also integrated into HuggingFace Transformers ( GPT-4 is a close-sourced model with paid API service (text-only API as of now).
  • Fast vs. Slow: BLIP-2 runs much faster than GPT-4. The inference time of BLIP-2 for each image is around 1 second on a single GPU. According to the GPT-4’s livestream, their multimodal inference time of GPT-4 took nearly 40 seconds to process one image.
  • Unsupervised learning vs. (presumably) Supervised learning: BLIP-2 is trained on large amounts of noisy image-text pairs automatically crawled from the Internet. Although the learning paradigm of GPT-4 has not been released, it could be reasonably deduced from ChatGPT that GPT-4 may have used large human-annotated datasets.

BLIP-2 is a scalable multimodal pre-training method that enables any LLMs to understand images while keeping their parameters entirely frozen. It is significantly more compute-efficient than existing multimodal pre-training methods. Why? BLIP-2 effectively Bootstraps Language-Image Pre-training with frozen image encoders and frozen LLMs. For example, to transform an existing 11B-LLM into a state-of-the-art multimodal foundation model, it only requires training of less than 2% parameters (only 188M trainable parameters).

BLIP-2 is the first to unlock the capability of zero-shot instructed image-to-text generation. Given an input image, BLIP-2 can generate various natural language responses according to the user’s instruction. The following figure shows some examples from BLIP-2.

Example outputs from BLIP-2

Example outputs from BLIP-2

Checkout our twitter thread for more interesting examples and use cases:

How does BLIP-2 work? Let’s take a deeper look at our pre-training method.

How BLIP-2 works

For LLMs to understand visual content, the key is to bridge the vision-language modality gap. Since LLMs have not seen any images during their natural language pre-training, it is challenging to bridge the modality gap, especially when the LLMs remain frozen. To this end, we propose a Querying Transformer (Q-Former) pre-trained with a new two-stage pre-training strategy. As shown in the following figure, after pre-training, the Q-Former can effectively act as a bridge between a frozen image encoder and a frozen LLM, thus closing the modality gap.

Overview of BLIP-2 two-stage pre-training strategy

The first stage is vision-and-language representation learning. In this stage, we connect the Q-Former to a frozen image encoder and pre-train with image-text pairs. The Q-Former learns to extract image features that are most relevant to the corresponding text. We reinvent the pre-training objectives from BLIP ( for vision-and-language representation learning.

Overview of Q-Former and the first stage of vision-language representation learning in BLIP-2

The second stage is vision-to-language generative learning. In this stage, we connect the output of Q-Former to a frozen LLM. We pre-train the Q-Former such that its output features can be interpreted by the LLM to generate the corresponding text. We experiment with both decoder-based LLMs (e.g. OPT) and encoder-decoder-based LLMs (e.g. FlanT5).

Overview of the second stage of vision-to-language generative learning in BLIP-2

During inference, we simply append the text instruction after the Q-Former’s output as input to the LLM. We have experimented with various image encoders and LLMs, and arrived at a promising observation: a stronger image encoder and a stronger LLM both lead to better performance with BLIP-2. This observation indicates that BLIP-2 is a generic vision-language pre-training method that can efficiently harvest the rapid advances in vision and natural language communities. BLIP-2 is an important groundbreaking technique towards building a multimodal conversational AI agent.

Community attention and efforts after BLIP-2 was released and open-sourced!

BLIP-2 has been extensively discussed and actively used by the AI communities.

Checkout these projects and resources that use BLIP & BLIP-2 for various tasks!

The Bottom Line

We’ve proposed BLIP-2, a novel scalable multimodal pretraining method that transforms any LLMs to multimodal foundation models. Powered by the family of BLIP-2 pretrained models, we’ve developed and released the world’s first open-sourced multimodal chatbot prototype.

There is still a lot of room to improve BLIP-2. Will BLIP-2 improve with supervised finetuning? How will BLIP-2 be useful for image generation? We look forward to further improving it with community feedback and exploring new use cases. Stay tuned for more exciting research!

Explore more