TL;DR: We propose ALPRO, a new video-and-language representation learning framework which achieves state-of-the-art performance on video-text retrieval and video question answering by learning fine-grained alignment between video regions and textual entities via entity prompts.
For more background (a review of key concepts used in this post), please see the Appendix.
Think about how dynamic and rich real-world human interaction is. From the football commentary you enjoyed with friends over beers, to some puzzling Jeopardy questions about The Matrix, to the untold recipes presented on the Hell’s Kitchen TV show, there appears to be no doubt that we all interact verbally within a dynamic world, where video and language play vital interconnected roles on an ongoing basis.
In other words, in the Digital Age, video and language content have become ubiquitous — all around us, constantly, 24/7. And humans, for the most part, seem to have no problem processing this fire hose of video and text content.
But what about Artificial Intelligence (AI)?
More specifically: given this fundamental ubiquitous-video-and-language feature of the real world, a fundamental scientific question arises: how can we craft AI systems that jointly comprehend video content and human language?
Now you may be wondering: why is it important to work on both video and language at the same time? In other words, why did we decide that it was important to build an AI model to reason about video and language together?
We felt it was important to build such an AI model because many practical applications require the model to understand both modalities simultaneously. One example is content-based video search, which enables searching of a large volume of online videos, even without textual metadata. Another application is video categorization and recommendation, where the model can look at both video content and textual descriptions to label videos. This will be helpful for customized video search and recommendation.
To tackle this AI challenge, vision-language or video-and-language pre-training (VLP) techniques have recently emerged as an effective approach.
Using VLP methods, one first pre-trains neural networks on a large number of video-text pairs from the web. Although some of this web data may be noisy, it turns out that neural networks can still learn useful representations for downstream applications.
Later, the parameters of the neural networks, obtained after pre-training, are used as the initialization for fine-tuning.
Despite promising progress, current VLP models suffer from several limitations, such as:
Misalignment across modalities: First, the video and text embeddings are not well aligned. There are a couple of ways to model the cross-modal alignment in the existing literature. For example, some work maximizes the similarities of unimodal embeddings from the same video-text pair, for example, by taking dot-product between them. The other group of work directly feeds the unimodal embeddings to the cross-modal encoder, in the hope that the cross-modal encoder can capture the alignment relation automatically. However, as we know that these unimodal embeddings of videos and text are produced by separate encoder networks, their embeddings therefore reside in different feature spaces. As a result, both approaches are quite ineffective in modeling the cross-modal alignment.
Lack of fine-grained video information: Second, many visually-grounded pre-training tasks do not explicitly model fine-grained regional visual information. This information is, however, critical for understanding the video content. Some prior attempts (such as ActBERT) employ object detectors to generate pseudo-labels as supervision. Specifically, they first apply, e.g. Faster-RCNN, on their video frames to produce object labels. Then they use these labels to supervise the pre-training models. However, if you have ever played with object detectors, you may know that there are in fact not so many different object categories on these annotated detection datasets. For example, the MSCOCO object detection dataset has less than a hundred different object classes. This easily limits the VLP model from learning the abundant varieties of objects and entity concepts. In short, VLP models suffer from imprecise detections and a restricted number of object categories.
To address the limitations of existing work, we propose ALign and PROmpt (ALPRO), a new video-and-language representation learning (pre-training) framework.
ALPRO follows the “pre-training-then-finetuning” paradigm used in the VLP techniques mentioned earlier, but addresses the limitations of those methods. Our framework operates on sparsely-sampled video frames and achieves more effective cross-modal alignment, without explicit object detectors.
The ultimate goal of our new approach is to improve downstream task performance — for example, on the tasks of video-text retrieval and video question answering (video QA). An improved pre-training strategy, as proposed in ALPRO, gives better video-language representations, which in turn contribute to improved downstream task performance.
The resulting pre-trained model in ALPRO achieves state-of-the-art performance on two classic tasks, video-text retrieval and video QA, across four public datasets. Our approach outperforms prior work by a substantial margin, while being much more label-efficient than other competing methods.
Let’s examine the techniques behind the ALPRO model in greater detail.
The novel ALPRO model (see figure above) consists of two main modules, a vision-language pre-training model and a prompter. The prompter serves to generate soft entity labels to supervise the pre-training of the video-language model. Both modules contain their own video encoder (TimeSformer) and text encoder (first 6-layers of BERT) to extract features for video and text inputs, respectively. The pre-training model has an additional multimodal encoder (last 6 layers of BERT) to further capture the interaction between the two modalities.
Now let’s take a closer look at two key pre-training tasks performed by ALPRO:
Examples of generated pseudo-labels for the selected video regions are shown in the figure below.
As you can see, the categories contained in these pseudo-labels capture quite a diverse range of different visual concepts. Most of these concepts are not observed in the detection datasets. What’s even cooler: to generate these pseudo-labels, ALPRO does not need any human annotations on either object bounding boxes or categories. All of these are learned in a self-supervised manner.
ALPRO achieves state-of-the-art performance on four common video-language downstream datasets for the video-text retrieval and video QA tasks, as shown in the tables below.
On the widely-used video-text retrieval dataset MSRVTT, ALPRO surpasses previous best retrieval model FiT by 5 absolute lift in Recall@1 under the zero-shot setup.
On video QA, ALPRO achieves comparable results with VQA-T that utilizes QA-specific domain pre-training pairs.
Note that ALPRO achieves its superior performance using only about 5-10% of the pre-training data used by previous approaches, meaning that ALPRO is much more label-efficient. For more details, please check out our paper .
Results on video-text retrieval datasets MSRVTT (left) and DiDeMo (right).
Results on video QA datasets.
Salesforce AI Research invites you to dive deeper into the concepts discussed in this blog post (links below). Connect with us on social media and our website to get regular updates on this and other research projects.
 Align and Prompt: Video-and-Language Pre-training with Entity Prompts. Dongxu Li , Junnan Li, Hongdong Li, Juan Carlos Niebles, Steven C.H. Hoi.
 PASS: An ImageNet replacement for self-supervised pretraining without humans. Yuki M. Asano, Christian Rupprecht, Andrew Zisserman, Andrea Vedaldi
Dongxu Li is a Research Scientist at Salesforce Research. His research focuses on multimodal understanding and its applications.
Junnan Li is a Senior Research Manager at Salesforce Research. His current research focuses on vision and language AI. His ultimate research goal is to build generic AI models that can self-learn without human supervision.
Steven C.H. Hoi is Managing Director of Salesforce Research Asia and oversees Salesforce's AI research and development activities in APAC. His research interests include machine learning and a broad range of AI applications.
Donald Rose is a Technical Writer at Salesforce AI Research. Specializing in content creation and editing, Dr. Rose works on multiple projects, including blog posts, video scripts, news articles, media/PR material, social media, writing workshops, and more. He also helps researchers transform their work into publications geared towards a wider audience.