TL;DR: We propose BLIP-2, a scalable multimodal pre-training method that enables any Large Language Models (LLMs) to ingest and understand images, unlocks the capabilities of zero-shot image-to-text generation and powers the world’s first open-sourced multimodal Chatbot prototype.
OpenAI just released GPT-4, a powerful new multimodal AI model with its eye-catching capability of accepting image inputs to generate text. However, such capability is not new, which has been shown in our recent BLIP-2 models and prototype released on 30 January 2023. Our novel BLIP-2 method enables us to build the world’s first open-sourced multimodal chatbot prototype. Below we discuss the differences between our BLIP-2 model and OpenAI’s GPT-4.
BLIP-2 vs. GPT-4
BLIP-2 is a scalable multimodal pre-training method that enables any LLMs to understand images while keeping their parameters entirely frozen. It is significantly more compute-efficient than existing multimodal pre-training methods. Why? BLIP-2 effectively Bootstraps Language-Image Pre-training with frozen image encoders and frozen LLMs. For example, to transform an existing 11B-LLM into a state-of-the-art multimodal foundation model, it only requires training of less than 2% parameters (only 188M trainable parameters).
BLIP-2 is the first to unlock the capability of zero-shot instructed image-to-text generation. Given an input image, BLIP-2 can generate various natural language responses according to the user’s instruction. The following figure shows some examples from BLIP-2.
Example outputs from BLIP-2
Example outputs from BLIP-2
Checkout our twitter thread for more interesting examples and use cases: https://twitter.com/LiJunnan0409/status/1621649677543440384
How does BLIP-2 work? Let’s take a deeper look at our pre-training method.
How BLIP-2 works
For LLMs to understand visual content, the key is to bridge the vision-language modality gap. Since LLMs have not seen any images during their natural language pre-training, it is challenging to bridge the modality gap, especially when the LLMs remain frozen. To this end, we propose a Querying Transformer (Q-Former) pre-trained with a new two-stage pre-training strategy. As shown in the following figure, after pre-training, the Q-Former can effectively act as a bridge between a frozen image encoder and a frozen LLM, thus closing the modality gap.
Overview of BLIP-2 two-stage pre-training strategy
The first stage is vision-and-language representation learning. In this stage, we connect the Q-Former to a frozen image encoder and pre-train with image-text pairs. The Q-Former learns to extract image features that are most relevant to the corresponding text. We reinvent the pre-training objectives from BLIP (https://blog.salesforceairesearch.com/blip-bootstrapping-language-image-pretraining/) for vision-and-language representation learning.
Overview of Q-Former and the first stage of vision-language representation learning in BLIP-2
The second stage is vision-to-language generative learning. In this stage, we connect the output of Q-Former to a frozen LLM. We pre-train the Q-Former such that its output features can be interpreted by the LLM to generate the corresponding text. We experiment with both decoder-based LLMs (e.g. OPT) and encoder-decoder-based LLMs (e.g. FlanT5).
Overview of the second stage of vision-to-language generative learning in BLIP-2
During inference, we simply append the text instruction after the Q-Former’s output as input to the LLM. We have experimented with various image encoders and LLMs, and arrived at a promising observation: a stronger image encoder and a stronger LLM both lead to better performance with BLIP-2. This observation indicates that BLIP-2 is a generic vision-language pre-training method that can efficiently harvest the rapid advances in vision and natural language communities. BLIP-2 is an important groundbreaking technique towards building a multimodal conversational AI agent.
Community attention and efforts after BLIP-2 was released and open-sourced!
BLIP-2 has been extensively discussed and actively used by the AI communities.
Checkout these projects and resources that use BLIP & BLIP-2 for various tasks!
The Bottom Line
We’ve proposed BLIP-2, a novel scalable multimodal pretraining method that transforms any LLMs to multimodal foundation models. Powered by the family of BLIP-2 pretrained models, we’ve developed and released the world’s first open-sourced multimodal chatbot prototype.
There is still a lot of room to improve BLIP-2. Will BLIP-2 improve with supervised finetuning? How will BLIP-2 be useful for image generation? We look forward to further improving it with community feedback and exploring new use cases. Stay tuned for more exciting research!
Explore more