VQGAN-CLIP: Open Domain Image Generation and Editing with Natural Language Guidance

Katherine Crowson,Stella Biderman,Daniel Kornis,Dashiell Stander,Eric Hallahan,Louis Castricato,Edward Raff

Published 2022 in European Conference on Computer Vision

ABSTRACT

Generating and editing images from open domain text prompts is a challenging task that heretofore has required expensive and specially trained models. We demonstrate a novel methodology for both tasks which is capable of producing images of high visual quality from text prompts of significant semantic complexity without any training by using a multimodal encoder to guide image generations. We demonstrate on a variety of tasks how using CLIP [37] to guide VQGAN [11] produces higher visual quality outputs than prior, less flexible approaches like DALL-E [38], GLIDE [33] and Open-Edit [24], despite not being trained for the tasks presented. Our code is available in a public repository.

PUBLICATION RECORD

Publication year
2022
Venue
European Conference on Computer Vision
Publication date
2022-04-18
Fields of study
Computer Science
Identifiers
DOI 10.48550/arXiv.2204.08583 arXiv 2204.08583
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Music2Video: Automatic Generation of Music Video with fusion of audio and text
2022cited by this paper
GPT-NeoX-20B: An Open-Source Autoregressive Language Model
2022cited by this paper
FlexIT: Towards Flexible Semantic Image Translation
2022cited by this paper
CLIP-GEN: Language-Free Training of a Text-to-Image Generator with CLIP
2022cited by this paper
The Dawn of the Human-Machine Era: A forecast of new and emerging language technologies
2021cited by this paper
Multimodal Few-Shot Learning with Frozen Language Models
2021cited by this paper
The Values Encoded in Machine Learning Research
2021cited by this paper
Modern Evolution Strategies for Creativity: Fitting Concrete Images and Abstract Concepts
2021cited by this paper
AffectGAN: Affect-Based Generative Art Driven by Semantics
2021cited by this paper
Words to Matter: De novo Architected Materials Design Using Transformer Neural Networks
2021cited by this paper
Fast Model Editing at Scale
2021cited by this paper
Wav2CLIP: Learning Robust Audio Representations from Clip
2021cited by this paper
WenLan 2.0: Make AI Imagine via a Multimodal Foundation Model
2021cited by this paper
Telling Creative Stories Using Generative Visual Aids
2021cited by this paper
Merging Models with Fisher-Weighted Averaging
2021cited by this paper
Blended Diffusion for Text-driven Editing of Natural Images
2021cited by this paper
Vector Quantized Diffusion Model for Text-to-Image Synthesis
2021cited by this paper
CLIPstyler: Image Style Transfer with a Single Text Condition
2021cited by this paper
FuseDream: Training-Free Text-to-Image Generation with Improved CLIP+GAN Space Optimization
2021influential reference
Text2Mesh: Text-Driven Neural Stylization for Meshes
2021cited by this paper
MAGMA - Multimodal Augmentation of Generative Models through Adapter-based Finetuning
2021cited by this paper
GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models
2021influential reference
diffvg+CLIP: Generating Painting Trajectories from Text
2021influential reference
Learning Transferable Visual Models From Natural Language Supervision
2021cited by this paper
Insiders and Outsiders in Research on Machine Learning and Society
2021cited by this paper
Zero-Shot Text-to-Image Generation
2021influential reference
StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery
2021cited by this paper
Editing Factual Knowledge in Language Models
2021cited by this paper
The Power of Scale for Parameter-Efficient Prompt Tuning
2021cited by this paper
CLIPDraw: Exploring Text-to-Drawing Synthesis through Language-Image Encoders
2021influential reference
Open-Edit: Open-Domain Image Manipulation with Open-Vocabulary Instructions
2020influential reference
MUSE: Textual Attributes Guided Portrait Painting Generation
2020cited by this paper
Taming Transformers for High-Resolution Image Synthesis
2020cited by this paper
SESAME: Semantic Editing of Scenes by Adding, Manipulating or Erasing Objects
2020cited by this paper
The Cost of Training NLP Models: A Concise Overview
2020cited by this paper
Rewriting a Deep Generative Model
2020cited by this paper
Semantic Pyramid for Image Generation
2020cited by this paper
Parameter-Efficient Transfer Learning for NLP
2019cited by this paper
Kornia: an Open Source Differentiable Computer Vision Library for PyTorch
2019cited by this paper
ManiGAN: Text-Guided Image Manipulation
2019cited by this paper
Text-Adaptive Generative Adversarial Networks: Manipulating Images with Natural Language
2018cited by this paper
Semantic Image Synthesis via Adversarial Learning
2017cited by this paper
Neural Discrete Representation Learning
2017cited by this paper
Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization
2016cited by this paper
Adam: A Method for Stochastic Optimization
2014cited by this paper
Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps
2013cited by this paper
ImageNet: A large-scale hierarchical image database
2009cited by this paper
Understanding Neural Networks
1980cited by this paper

CITED BY

A Difference-in-Difference Approach to Detecting AI-Generated Images
2026cites this paper
Benchmarking Semantic Segmentation Models via Appearance and Geometry Attribute Editing
2026cites this paper
See Less, Drive Better: Generalizable End-to-End Autonomous Driving via Foundation Models Stochastic Patch Selection
2026cites this paper
Quality Evaluation of AI-Generated Images: Subjective Study and Objective Methodology
2026cites this paper
A review of instruction-guided image editing
2026cites this paper
FBSDiff++: Improved Frequency Band Substitution of Diffusion Features for Efficient and Highly Controllable Text-Driven Image-to-Image Translation
2026cites this paper
Authorize-on-Demand: Dynamic Authorization with Legality-Aware Intellectual Property Protection for VLMs
2026cites this paper
NOVA: Sparse Control, Dense Synthesis for Pair-Free Video Editing
2026cites this paper
Generate individual spatiotemporal activity sequences from population synthesis via deep learning approaches
2026cites this paper
Improving Zero-Shot Generalization for CLIP With Prompt Ensemble Self-Distillation
2026cites this paper
Draw What You Hear: High-Fidelity Image Generation and Manipulation via SoundAdapter
2025cites this paper
Attribute-Enhanced Fine Tuning for Subject-Driven Generation
2025cites this paper
A multi-task joint learning-based algorithm for small object detection in maritime UAV imagery under adverse weather conditions
2025cites this paper
Cultural Bias in Text-to-Image Models: A Systematic Review of Bias Identification, Evaluation, and Mitigation Strategies
2025cites this paper
SketchAssist: A Practical Assistant for Semantic Edits and Precise Local Redrawing
2025cites this paper
Hawk: Leveraging Spatial Context for Faster Autoregressive Text-to-Image Generation
2025cites this paper
Underwater Diffusion Attention Network with Contrastive Language-Image Joint Learning for Underwater Image Enhancement
2025cites this paper
Concentration bounds on response-based vector embeddings of black-box generative models
2025cites this paper
Automating the Search for Artificial Life With Foundation Models.
2025cites this paper
DC-AR: Efficient Masked Autoregressive Image Generation with Deep Compression Hybrid Tokenizer
2025cites this paper
Exploring the Potential and Challenges of Generative AI in Assistive Technology Design and Fabrication: Insights from Occupational Therapists
2025cites this paper
Understanding the Implicit User Intention via Reasoning with Large Language Model for Image Editing
2025cites this paper
Video Deepfake Abuse: How Company Choices Predictably Shape Misuse Patterns
2025cites this paper
AutoConcept: Unsupervised Extraction of Constituent Concepts from Single Image
2025cites this paper
An Empirical Analysis of VLM-based OOD Detection: Mechanisms, Advantages, and Sensitivity
2025cites this paper
Culturally Grounded Text-to-Image Gen-AI: LoRa Fine-Tuned Stable Diffusion Models to Generate Cartoon Characters of San Learners
2025cites this paper
Design of personalized creation model for cultural and creative products based on evolutionary adaptive network
2025cites this paper
Artistic turing test: The challenge of differentiating human and AI-generated art
2025cites this paper
Model Diagnosis and Correction via Linguistic and Implicit Attribute Editing
2025cites this paper
Geodesic feature augmentation for zero-shot text-guided diffusion style transfer
2025cites this paper
Revisiting Audio-language Pretraining for Learning General-purpose Audio Representation
2025cites this paper
Highly Compressed Tokenizer Can Generate Without Training
2025cites this paper
EarthMapper: Visual Autoregressive Models for Controllable Bidirectional Satellite-Map Translation
2025cites this paper
Evaluating Large Language Models: Challenges, Limitations, and Future Directions
2025cites this paper
CoCoDiff: Diversifying Skeleton Action Features via Coarse-Fine Text-Co-Guided Latent Diffusion
2025cites this paper
Mixture of prompts learning for vision-language models
2025cites this paper
MLKD-CLIP: Multi-layer Feature Knowledge Distillation of CLIP for Open-vocabulary Action Recognition
2025cites this paper
Perceptual image compression with textual side information
2025cites this paper
Style Transfer: A Decade Survey
2025cites this paper
Representation Entanglement for Generation:Training Diffusion Transformers Is Much Easier Than You Think
2025cites this paper
SATURN: Autoregressive Image Generation Guided by Scene Graphs
2025cites this paper
Fast Aesthetic Image Generation for Paper-Cutting Style
2025cites this paper
Definition of the architectural style metric: An approach to quantitative analysis of design using language-image model
2025cites this paper
A Survey on Proactive Deepfake Defense: Disruption and Watermarking
2025cites this paper
Generative Semantic Probing for Vision-Language Models via Hierarchical Feature Optimization
2025influential citation
MPPR: Memory-Prior-based Prompt Refinement in Continuous Space for Advanced Text-to-Image Generation
2025cites this paper
The Devil is in Attention Sharing: Improving Complex Non-rigid Image Editing Faithfulness via Attention Synergy
2025cites this paper
TG-TSGNet: A Text-Guided Arbitrary-Resolution Terrain Scene Generation Network
2025cites this paper
InstructVEdit: A Holistic Approach for Instructional Video Editing
2025cites this paper
TGDrag: Adding Semantic Control into Point-based Image Editing via Text Guidance
2025cites this paper
Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing
2025cites this paper
InterLCM: Low-Quality Images as Intermediate States of Latent Consistency Models for Effective Blind Face Restoration
2025cites this paper
UniCanvas: Affordance-Aware Unified Real Image Editing via Customized Text-to-Image Generation
2025cites this paper
Learning to Generalize without Bias for Open-Vocabulary Action Recognition
2025cites this paper
DA2Diff: Exploring Degradation-aware Adaptive Diffusion Priors for All-in-One Weather Restoration
2025cites this paper
Exploring AI-Fabrication in Shaping the Future of DIY-AT Design: Insights from Makers
2025cites this paper
DisenStyler: Text-driven fast image stylization using content disentanglement and style adaptive matching
2025cites this paper
PromptMap: Supporting Exploratory Text-to-Image Generation
2025cites this paper
TrafficCLIP: A lightweight cross-modal framework for network traffic classification
2025cites this paper
EditMGT: Unleashing Potentials of Masked Generative Transformers in Image Editing
2025influential citation
FROM SLICES TO SPACES Design ideation on architectural models through AI-generated image sequences
2025cites this paper
Moodifier: MLLM-Enhanced Emotion-Driven Image Editing
2025cites this paper
RAPID: Retrieval and Predictability for Improved Stable Diffusion
2025cites this paper
Religious Bias Landscape in Language and Text-to-Image Models: Analysis, Detection, and Debiasing Strategies
2025cites this paper
Text2Avatar: Articulated 3D Avatar Creation With Text Instructions
2025cites this paper
ALBAR: Adversarial Learning approach to mitigate Biases in Action Recognition
2025cites this paper
Text-Guided Editable 3D City Scene Generation
2025cites this paper
DiffDesign: A diffusion model using garment Knowledge-Enhanced for Fashion Design Synthesis
2025cites this paper
Diffuse Your Data Blues: Augmenting Low-Resource Datasets via User-Assisted Diffusion
2025cites this paper
Classifier-guided CLIP Distillation for Unsupervised Multi-label Classification
2025cites this paper
Advanced Product Personalization in Blockchain-Enabled Metaverse: A Diffusion Model for Automatic Style Generation
2025cites this paper
Integrating Speech-to-Text for Image Generation Using Generative Adversarial Networks
2025influential citation
StyleMe3D: Stylization with Disentangled Priors by Multiple Encoders on 3D Gaussians
2025cites this paper
CTD-inpainting: Towards the Coherence of Text-driven Inpainting with Blended Diffusion
2025cites this paper
Perpetuating Misogyny with Generative AI: How Model Personalization Normalizes Gendered Harm
2025cites this paper
Learning Graph Representation of Agent Diffusers
2025influential citation
Beyond Editing Pairs: Fine-Grained Instructional Image Editing via Multi-Scale Learnable Regions
2025cites this paper
Rethinking Image Generation From Scene Graphs With Attention Mechanism
2024cites this paper
Person in Place: Generating Associative Skeleton-Guidance Maps for Human-Object Interaction Image Editing
2024cites this paper
LARE: Latent Augmentation using Regional Embedding with Vision-Language Model
2024cites this paper
Mixture of Prompt Learning for Vision Language Models
2024cites this paper
360PanT: Training-Free Text-Driven 360-Degree Panorama-to-Panorama Translation
2024cites this paper
InverseMeetInsert: Robust Real Image Editing via Geometric Accumulation Inversion in Guided Diffusion Models
2024cites this paper
DAP-LED: Learning Degradation-Aware Priors with Clip for Joint Low-Light Enhancement and Deblurring
2024cites this paper
A survey of multimodal composite editing and retrieval
2024cites this paper
Text-to-Image Generation Via Energy-Based CLIP
2024cites this paper
VISA: Video Interactive Search with Advanced Visual Programming
2024cites this paper
Detection-Driven Object Count Optimization for Text-to-Image Diffusion Models
2024cites this paper
FAGStyle: Feature Augmentation on Geodesic Surface for Zero-shot Text-guided Diffusion Image Style Transfer
2024cites this paper
CLIP-Flow: Decoding images encoded in CLIP space
2024cites this paper
CDM: Text-Driven Image Editing with Composable Diffusion Models
2024cites this paper
Lightweight dual-path octave generative adversarial networks for few-shot image generation
2024cites this paper
FBSDiff: Plug-and-Play Frequency Band Substitution of Diffusion Features for Highly Controllable Text-Driven Image Translation
2024cites this paper
Few-shot Defect Image Generation based on Consistency Modeling
2024cites this paper
Dataset Scale and Societal Consistency Mediate Facial Impression Bias in Vision-Language AI
2024cites this paper
Text2LiDAR: Text-guided LiDAR Point Cloud Generation via Equirectangular Transformer
2024cites this paper
Diffusion Feedback Helps CLIP See Better
2024cites this paper
Ownership Authentication and Integrity Verification of Digital Images Using Generative Models and Custom Signature
2024influential citation
Understanding Fashion Designers’ Behavior Using Generative AI for Early-Stage Concept Ideation and Revision
2024cites this paper
An Initial Exploration of Employing Large Multimodal Models in Defending Against Autonomous Vehicles Attacks
2024cites this paper