direct preference optimization

Aligning Diffusion Models to Human Preferences

TLDR Learning from human preferences, specifically Reinforcement Learning from Human Feedback (RLHF) has been a key recent component in the development of large language models such as ChatGPT or Llama2. Up until recently, the impact of human feedback training on text-to-image models was much more limited. In this work, Diffusion-DPO,

08 Jan 2024 • Bram Wallace • #reinforcement-learning

Blog

Aligning Diffusion Models to Human Preferences