GlueGen: Plug and Play Multi-modal Encoders for X-to-image Generation

4 min read

Other authors include: Can Qin, Stefano Ermon, Yun Fu

GlueGen was accepted by ICCV.

In the rapidly advancing field of text-to-image synthesis, the remarkable progress in generating lifelike images from textual prompts has been evident. However, a significant challenge remains: how can we seamlessly integrate powerful pre-trained text encoders into existing image generators without the need for time-consuming and resource-hungry retraining? This is where  "A Plug-and-Play Approach for X-to-Image Generation" comes into play, an article presenting the groundbreaking GlueNet framework. This innovative solution offers a flexible way to "glue" new components, such as advanced text encoders or even audio models, into existing generators. By bridging the gap between different modalities, GlueNet paves the way for enhanced text understanding, multilingual capabilities, and the incorporation of new input types like sound. Delve into this article to discover how GlueNet is revolutionizing the future of generative models.


Text-to-image (T2I) synthesis, generating photorealistic images from text prompts, has witnessed a tremendous surge in capabilities recently. Models like Imagen, Stable Diffusion, and DALL-E-3 can produce impressively high-quality and diverse images guided by input descriptions. These breakthroughs are powered by advances in deep generative models, especially diffusion-based approaches. A critical aspect of this progress involves conditioning the model on textual or other modal inputs at each denoising step enable control over the generated image content. The text prompt guides the model to reconstruct an image matching the description. This conditional diffusion approach has proven highly effective for text-to-image generation.

Nevertheless, existing models exhibit a high degree of coupling between their encoders and decoders. The text encoder is specifically optimized to produce embeddings suited for the latent representation of that particular model. This makes it quite difficult to improve or modify components of the system. For example, directly replacing the text encoder with a more powerful pre-trained model fails because its representations mismatch the latent space the diffusion decoder expects.

Substantial model retraining would be required to adapt the text and image pathways to the new representation space. But this is time-consuming, data-hungry, GPU-hungry, and not eco-friendly. Retraining the full model end-to-end quickly becomes prohibitively costly. This severely hinders iterating on and enhancing text-to-image models.


The goal is to facilitate seamless plug-in of off-the-shelf pre-trained components like text encoders, audio encoders, point cloud encoders, etc. into existing generators without modification. This would allow for enhancing text with better language understanding, adding multilingual capabilities, and incorporating new modalities like sound. Critically, this needs to be achieved without expensive end-to-end retraining.


(a) Illustration of feature transformation throughout the model translation/alignment. (b) The general pipeline and learning objectives of our proposed GlueNet. (c) Detailed architecture of GlueNet Encoder/Decoder.

The core idea is to insert an alignment module between the new component and generator to map representations into a shared space. The encoder portion transforms features from the new model into the latent space expected by the generator. It minimizes both element-wise and distribution-level differences between the new features and the generator's latent space. This enables the conditioned diffusion decoder to understand the new representations without any parameter changes. Critically, a decoder module then reconstructs the original features from the aligned space. This preserves the full information and semantics captured by the new component's embeddings. Alignment would otherwise degrade the representations.

GlueNet is trained solely on readily-available parallel text, audio-label, other pairs, or you name it. No conditional image data is needed. The objectives are the reconstruction loss and adversarial alignment loss measured directly between the parallel samples. Only GlueNet's parameters are updated, keeping the generator fixed.

During inference, the new component's output feeds into the GlueNet encoder, which aligns it to the latent space, then into the unchanged generator for conditional image synthesis. This approach neatly circumvents the necessity for any generator retraining.


The researchers comprehensively validated GlueNet's capabilities across multiple experiments:

1. For text-to-image generation, they upgraded the Latent Diffusion Model by replacing its standard text encoder with the much larger T5-3B model. GlueNet successfully aligned the representations, improving image quality and controllability without any finetuning. Further finetuning on image-text pairs provided additional gains.

2. For metalinguistics-to-image synthesis, they aligned the XLM-Roberta multilingual text encoder using GlueNet. This enabled generating images from prompts in Chinese, French, Spanish, Italian and Japanese without any model retraining. Performance exceeded translation baselines while requiring far less training data.

3. They enabled direct sound-to-image generation by bridging the AudioCLIP audio encoder with Stable Diffusion's text pathway using GlueNet. This allowed plausible image generation from sound inputs without any finetuning. It significantly outperformed a baseline retrieving images with audio labels.

4. Experiments also demonstrated GlueNet could blend modalities as input guidance, enabling text-audio mixtures to guide image generation. Initial results successfully incorporated point cloud networks as well, showing the flexibility of our approach.

Bottom Line

GlueNet offers an exciting path forward for controllable X-to-image generation. The plug-and-play approach substantially reduces the barriers to enhancing existing models and building more powerful generative systems. With capabilities rapidly improving, alignment techniques will likely be essential to experimenting with and efficiently integrating new state-of-the-art components. This work provides a strong foundation, demonstrating GlueNet successfully blending models across multimodalities without retraining the development models.

Explore More