BLIP-2: Scalable Pre-training of Multimodal Foundation Models for the World's First Open-source Multimodal Chatbot

5 min read

Junnan Li

Dongxu Li

Steven Hoi

TL;DR: We propose BLIP-2, a scalable multimodal pre-training method that enables any Large Language Models (LLMs) to ingest and understand images, unlocks the capabilities of zero-shot image-to-text generation and powers the world’s first open-sourced multimodal Chatbot prototype.

OpenAI just released GPT-4, a powerful new multimodal AI model with its eye-catching capability of accepting image inputs to generate text. However, such capability is not new, which has been shown in our recent BLIP-2 models and prototype released on 30 January 2023. Our novel BLIP-2 method enables us to build the world’s first open-sourced multimodal chatbot prototype. Below we discuss the differences between our BLIP-2 model and OpenAI’s GPT-4.

BLIP-2 vs. GPT-4

Generic vs. Specific: BLIP-2 is a novel and generic multimodal pre-training methodology for vision-language pretraining, which can enable any family of LLMs to understand images and unlock zero-shot image-to-text generation capabilities. GPT-4 is a specific type of pre-trained model and its technical novelty is unclear (not disclosed).
Open-source vs. Closed-source (API-only): The code and models of BLIP-2 are open-sourced in the LAVIS library (https://github.com/salesforce/LAVIS) and also integrated into HuggingFace Transformers (https://huggingface.co/docs/transformers/main/model_doc/blip-2). GPT-4 is a close-sourced model with paid API service (text-only API as of now).
Fast vs. Slow: BLIP-2 runs much faster than GPT-4. The inference time of BLIP-2 for each image is around 1 second on a single GPU. According to the GPT-4’s livestream, their multimodal inference time of GPT-4 took nearly 40 seconds to process one image.
Unsupervised learning vs. (presumably) Supervised learning: BLIP-2 is trained on large amounts of noisy image-text pairs automatically crawled from the Internet. Although the learning paradigm of GPT-4 has not been released, it could be reasonably deduced from ChatGPT that GPT-4 may have used large human-annotated datasets.

BLIP-2 is a scalable multimodal pre-training method that enables any LLMs to understand images while keeping their parameters entirely frozen. It is significantly more compute-efficient than existing multimodal pre-training methods. Why? BLIP-2 effectively Bootstraps Language-Image Pre-training with frozen image encoders and frozen LLMs. For example, to transform an existing 11B-LLM into a state-of-the-art multimodal foundation model, it only requires training of less than 2% parameters (only 188M trainable parameters).

BLIP-2 is the first to unlock the capability of zero-shot instructed image-to-text generation. Given an input image, BLIP-2 can generate various natural language responses according to the user’s instruction. The following figure shows some examples from BLIP-2.

Example outputs from BLIP-2

Example outputs from BLIP-2

Checkout our twitter thread for more interesting examples and use cases: https://twitter.com/LiJunnan0409/status/1621649677543440384

How does BLIP-2 work? Let’s take a deeper look at our pre-training method.

How BLIP-2 works

For LLMs to understand visual content, the key is to bridge the vision-language modality gap. Since LLMs have not seen any images during their natural language pre-training, it is challenging to bridge the modality gap, especially when the LLMs remain frozen. To this end, we propose a Querying Transformer (Q-Former) pre-trained with a new two-stage pre-training strategy. As shown in the following figure, after pre-training, the Q-Former can effectively act as a bridge between a frozen image encoder and a frozen LLM, thus closing the modality gap.

Overview of BLIP-2 two-stage pre-training strategy

The first stage is vision-and-language representation learning. In this stage, we connect the Q-Former to a frozen image encoder and pre-train with image-text pairs. The Q-Former learns to extract image features that are most relevant to the corresponding text. We reinvent the pre-training objectives from BLIP (https://blog.salesforceairesearch.com/blip-bootstrapping-language-image-pretraining/) for vision-and-language representation learning.

Overview of Q-Former and the first stage of vision-language representation learning in BLIP-2

The second stage is vision-to-language generative learning. In this stage, we connect the output of Q-Former to a frozen LLM. We pre-train the Q-Former such that its output features can be interpreted by the LLM to generate the corresponding text. We experiment with both decoder-based LLMs (e.g. OPT) and encoder-decoder-based LLMs (e.g. FlanT5).

Overview of the second stage of vision-to-language generative learning in BLIP-2

During inference, we simply append the text instruction after the Q-Former’s output as input to the LLM. We have experimented with various image encoders and LLMs, and arrived at a promising observation: a stronger image encoder and a stronger LLM both lead to better performance with BLIP-2. This observation indicates that BLIP-2 is a generic vision-language pre-training method that can efficiently harvest the rapid advances in vision and natural language communities. BLIP-2 is an important groundbreaking technique towards building a multimodal conversational AI agent.

Community attention and efforts after BLIP-2 was released and open-sourced!

BLIP-2 has been extensively discussed and actively used by the AI communities.

Checkout these projects and resources that use BLIP & BLIP-2 for various tasks!

BLIP-2 + ChatGPT: https://github.com/Vision-CAIR/ChatCaptioner
BLIP + ChatGPT: https://github.com/microsoft/visual-chatgpt
ImageSEO: https://wordlift.io/blog/en/image-seo-using-ai/
BLIP + DreamBooth: https://github.com/KaliYuga-ai/DreamBooth_With_Dataset_Captioning/blob/main/DreamBooth_With_Dataset_Captioning.ipynb
BLIP-2 on Huggingface: https://huggingface.co/blog/blip-2
BLIP Blog (Previously released model): https://blog.salesforceairesearch.com/blip-bootstrapping-language-image-pretraining/

The Bottom Line

We’ve proposed BLIP-2, a novel scalable multimodal pretraining method that transforms any LLMs to multimodal foundation models. Powered by the family of BLIP-2 pretrained models, we’ve developed and released the world’s first open-sourced multimodal chatbot prototype.

There is still a lot of room to improve BLIP-2. Will BLIP-2 improve with supervised finetuning? How will BLIP-2 be useful for image generation? We look forward to further improving it with community feedback and exploring new use cases. Stay tuned for more exciting research!

Explore more

Read more details about our research in our research paper https://arxiv.org/abs/2301.12597
Code: https://github.com/salesforce/LAVIS/tree/main/projects/blip2
More about LAVIS - a one-stop vision-language library: https://blog.salesforceairesearch.com/lavis-language-vision-library/
Contact: Junnan Li at junnan.li@salesforce.com
Follow us on Twitter: @SFResearch @Salesforce
Visit our main website to learn more about all of the exciting projects Salesforce AI is working on: https://www.salesforceairesearch.com