Align before Fuse (ALBEF): Advancing Vision-language Understanding with Contrastive Learning
> TL; DR: We propose a new vision-language representation learning framework which achieves state-of-the-art performance by first aligning the unimodal representations before fusing them. Vision and language are two of the most fundamental channels for humans to perceive the world. It has been a long-standing goal in AI to build intelligent
19 Jul 2021 • Junnan Li • #vision and language