Salesforce Research at CVPR 2022

Conference Overview

The IEEE / CVF Computer Vision and Pattern Recognition Conference (CVPR) is the annual conference on Computer Vision. CVPR is composed of both the main conference, as well as workshops and other courses, to provide a unique learning experience and networking opportunities in the field of Computer Vision.

CVPR 2022 will take place in a hybrid format, hosting people both virtually and in-person in New Orleans, LA from June 19th - 24th, 2022.

Salesforce AI Research Publications at CVPR 2022

Salesforce Research is pleased to announce a total of 2 accepted papers from our team of leading researchers.

Our accepted authors will present their work at CVPR through pre-recorded talks and in-person poster sessions during the main conference. We look forward to sharing some of our exciting new research, whether virtually or face-to-face in New Orleans!

Salesforce Researchers are shown in bold in the publication descriptions below.

Use All The Labels: A Hierarchical Multi-Label Contrastive Learning Framework

Shu Zhang, Ran Xu, Caiming Xiong, Chetan Ramaiah

In this paper, we present a hierarchical multi-label representation learning framework that can leverage all available labels and preserve the hierarchical relationship between image classes. We apply a hierarchical penalty to the contrastive loss, and enforce the hierarchy constraint. The loss function is data driven and automatically adapts to arbitrary multi-label structures. Experiments on several datasets show that our relationship-preserving embedding performs well on a variety of tasks, and outperforms the base- line supervised and self-supervised approaches.  

Align and Prompt: Video-and-Language Pre-training with Entity Prompts

Dongxu Li, Junnan Li, Hongdong Li, Juan Carlos Niebles, Steven Hoi

In this paper, we propose ALPRO, a new video-and-language representation learning framework which achieves state-of-the-art performance on video-text retrieval and video question answering. ALPRO learns fine-grained alignment between video regions and textual entities via two novel pre-training objectives, video-text contrastive (VTC) and prompting entity modeling (PEM). VTC loss focuses on instance-level video-text alignment while PEM enhances alignments between video regions and text entities. Experiments show that ALPRO outperforms prior works by a significant margin across multiple video-text retrieval and video question answering datasets, while being much more label-efficient.