HIVE is accepted to CVPR 2024.
Other authors include: Chia-Chih Chen, Ning Yu, Zeyuan Chen, Huan Wang, Silvio Savarese, Stefano Ermon, Caiming Xiong
We have seen the success of ChatGPT, which incorporates human feedback to align text generated by large language models to human preferences. Is it possible to align human feedback with state-of-the-art instructional image editing models? Now, Salesforce researchers have developed HIVE, one of the earliest works to fine-tune diffusion-based generative models with human feedback.
Background
Thanks to the impressive generation abilities of text-to-image generative models (e.g., Stable Diffusion), instructional image editing (e.g., InstructPix2Pix) has emerged as one of the most promising application scenarios. Different from traditional image editing where both the input and the edited caption are needed, instructional image editing only requires human-readable instructions. For instance, classic image editing approaches require an input caption “a dog is playing a ball”, and an edited caption “a cat is playing a ball”. In contrast, instructional image editing only needs editing instruction such as “change the dog to a cat”. This experience mimics how humans naturally perform image editing.
InstructPix2Pix fine-tunes a pre-trained stable diffusion by curating a triplet of the original image, instruction, and edited image, with the help of GPT-3 and Prompt-to-Prompt image editing.
Though achieving promising results, the training data generation process of InstructPix2Pix lacks explicit alignment between editing instructions and edited images.
Motivation
There is a pressing need to develop a model that aligns human feedback with diffusion-based models for the instructional image editing problem. For large language models such as InstructGPT and ChatGPT, we often first learn a reward function to reflect what humans care about or prefer on the generated text output, and then leverage reinforcement learning (RL) algorithms such as proximal policy optimization (PPO) to fine-tune the models. This process is often referred to as reinforcement learning with human feedback (RLHF).
The core challenge is how to leverage RLHF to fine-tune diffusion-based generative models. It is because applying PPO to maximize rewards during the fine-tuning process can be prohibitively expensive due to the hundreds or thousands of denoising steps required for each sampled image. Moreover, even with fast sampling methods, it is still challenging to back-propagate the gradient signal to the parameters of the U-Net.
Method
The proposed HIVE consists of three steps.
Experiments
By open-sourcing all the training and evaluation data, we evaluate our method using two datasets: a synthetic dataset with 15K image pairs from InstrcutPix2Pix and a self-collected 1K dataset with real image-instruction pairs. For the synthetic dataset, we follow InstructPix2Pix and plot the tradeoffs between CLIP image similarity and directional CLIP similarity. For the 1K dataset, we conduct a user study determined by major votes.
Key results:
We show that HIVE consistently outperforms InstructPix2Pix with both the official model and our replicated model using our enlarged training dataset.
plicated model using our enlarged training dataset.
Explore More
arxiv: https://arxiv.org/abs/2303.09618
Code and data: https://github.com/salesforce/HIVE
Web: https://shugerdou.github.io/hive
Contact: shu.zhang@salesforce.com, x.yang@salesforce.com, yihaofeng@salesforce.com.