ALPRO: Understanding Video and Language by Aligning Visual Regions and Text Entities
TL;DR: We propose ALPRO, a new video-and-language representation learning framework which achieves state-of-the-art performance on video-text retrieval and video question answering by learning fine-grained alignment between video regions and textual entities via entity prompts. For more background (a review of key concepts used in this post), please see the
31 May 2022 • Dongxu Li • #ALPRO