Modulating early visual processing by language

H. D. Vries,Florian Strub,Jérémie Mary,H. Larochelle,O. Pietquin,Aaron C. Courville

Published 2017 in Neural Information Processing Systems

ABSTRACT

It is commonly assumed that language refers to high-level visual concepts while leaving low-level visual processing unaffected. This view dominates the current literature in computational models for language-vision tasks, where visual and linguistic input are mostly processed independently before being fused into a single representation. In this paper, we deviate from this classic pipeline and propose to modulate the \emph{entire visual processing} by linguistic input. Specifically, we condition the batch normalization parameters of a pretrained residual network (ResNet) on a language embedding. This approach, which we call MOdulated RESnet (\MRN), significantly improves strong baselines on two visual question answering tasks. Our ablation study shows that modulating from the early stages of the visual processing is beneficial.

PUBLICATION RECORD

Publication year
2017
Venue
Neural Information Processing Systems
Publication date
2017-07-02
Fields of study
Linguistics, Computer Science
Identifiers
arXiv 1707.00683
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

MUTAN: Multimodal Tucker Fusion for Visual Question Answering
2017influential reference
GuessWhat?! Visual Object Discovery through Multi-modal Dialogue
2016influential reference
Hierarchical Question-Image Co-Attention for Visual Question Answering
2016cited by this paper
Multimodal Residual Learning for Visual QA
2016influential reference
Visual Dialog
2016cited by this paper
Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding
2016influential reference
A Learned Representation For Artistic Style
2016influential reference
Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering
2016cited by this paper
Ask Your Neurons: A Deep Learning Approach to Visual Question Answering
2016cited by this paper
Deep Residual Learning for Image Recognition
2015influential reference
Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering
2015cited by this paper
Ask Your Neurons: A Neural-Based Approach to Answering Questions about Images
2015influential reference
Exploring Models and Data for Image Question Answering
2015cited by this paper
VQA: Visual Question Answering
2015influential reference
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
2015cited by this paper
Words jump-start vision: a label advantage in object recognition.
2015cited by this paper
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
2015cited by this paper
Stacked Attention Networks for Image Question Answering
2015cited by this paper
Very Deep Convolutional Networks for Large-Scale Image Recognition
2014cited by this paper
Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation
2014cited by this paper
Prior Expectations Evoke Stimulus Templates in the Primary Visual Cortex
2014cited by this paper
Microsoft COCO: Common Objects in Context
2014cited by this paper
GloVe: Global Vectors for Word Representation
2014cited by this paper
Introduction to the special issue on
2011cited by this paper
Unconscious effects of language-specific terminology on preattentive color perception
2009cited by this paper
Visualizing Data using t-SNE
2008cited by this paper
Long Short-Term Memory
1997cited by this paper

CITED BY

Visual-to-Tactile Cross-Modal Generation Using a Class-Conditional GAN with Multi-Scale Discriminator and Hybrid Loss
2026cites this paper
Normalized clipping: A privacy-enhanced method in differentially private GANs
2026cites this paper
Self-supervised AI for decoding and designing disordered metamaterials
2026cites this paper
Time–Frequency Respiratory Impedance Maps Enable Within-Breath Deep Learning for Small Airway Dysfunction Identification
2026cites this paper
Multimodal Brownian bridge diffusion model for controllable synthetic medical image generation
2026cites this paper
State-wise linear modulation (SLim): A novel approach for steering large language models.
2026cites this paper
Decoupling Return-to-Go for Efficient Decision Transformer
2026cites this paper
A review of instruction-guided image editing
2026cites this paper
High-Resolution Range Profile Classifiers Require Aspect-Angle Awareness
2026cites this paper
FiLM-DiffRec: Lightweight Feature-wise Modulation for Enhanced Timestep Conditioning in Diffusion Recommender Systems
2026cites this paper
Patient-Conditioned Adaptive Offsets for Reliable Diagnosis across Subgroups
2026cites this paper
LAMSNN: Learnable adaptive modulation for artifact suppression in spiking underwater image enhancement networks
2025cites this paper
DiffCNBP: Lightweight Diffusion Model for IoMT-Based Continuous Cuffless Blood Pressure Waveform Monitoring Using PPG
2025cites this paper
A Text-Guided Query Adaptive Vision Transformer for Pansharpening
2025cites this paper
Towards biologically plausible DNN optimization: Replacing backpropagation and loss functions with a top-down credit assignment network
2025cites this paper
Hadamard Product in Deep Learning: Introduction, Advances and Challenges
2025cites this paper
Edge-aware baselines for ogbn-proteins in PyTorch Geometric: species-wise normalization, post-hoc calibration, and cost-accuracy trade-offs
2025cites this paper
Multimodal sarcasm detection based on sentiment-clue inconsistency global detection fusion network
2025cites this paper
Learn depth space from light field via a distance-constraint query mechanism
2025cites this paper
Modality Plug-and-Play: Runtime Modality Adaptation in LLM-Driven Autonomous Mobile Systems
2025cites this paper
FiLM-SimVP: Scalable Uncertainty Quantification in Spatiotemporal Forecasting
2025cites this paper
Diffusion Models vs. DCGANs for Class-Imbalanced Lung Cancer CT Classification: A Comparative Study
2025cites this paper
Towards Depth-Continuous Scene Representation With a Displacement Field for Robust Light Field Depth Estimation
2025cites this paper
Challenges and Limitations of Generative AI in Synthesizing Wearable Sensor Data
2025cites this paper
VLCIM: A Vision-Language Cyclic Interaction Model for Industrial Defect Detection
2025cites this paper
Stacked deep fusion GAN for enhanced text-to-image generation
2025cites this paper
Semantic Anchors Facilitate Task Encoding in Continual Learning
2025cites this paper
Plain Transformers Can be Powerful Graph Learners
2025cites this paper
CLIP-GAN: Stacking CLIPs and GAN for Efficient and Controllable Text-to-Image Synthesis
2025cites this paper
Ara-RATGAN for Arabic Text to Image Synthesis
2025cites this paper
EMF-GAN:Efficient Multilayer Fusion GAN for text-to-image synthesis
2025cites this paper
HingeRLC-GAN: Combatting Mode Collapse with Hinge Loss and RLC Regularization
2025cites this paper
Adversarial Learning for Text Image Semantic Consistency Using Deep Fusion (DF-GAN)
2025cites this paper
Transformer-based short-term memory attention for enhanced multimodal sentiment analysis
2025cites this paper
Parameter-Efficient Fine-Tuning With Frequency Adapter for Enhanced Sea–Land Segmentation
2025cites this paper
Simplifying Graph Transformers
2025cites this paper
FuncGNN: Learning Functional Semantics of Logic Circuits with Graph Neural Networks
2025cites this paper
Walking the Weight Manifold: a Topological Approach to Conditioning Inspired by Neuromodulation
2025cites this paper
IBNorm: Information-Bottleneck Inspired Normalization for Representation Learning
2025cites this paper
Legilimens: Performant Video Analytics on the System-on-Chip Edge
2025cites this paper
Using Test-Time Data Augmentation for Cross-Domain Atrial Fibrillation Detection from ECG Signals
2025cites this paper
Enhancing Text-to-Image Synthesis with Higher Fusion Powers in Deep Fusion GAN
2024cites this paper
High-Resolution Remote Sensing Image Segmentation With Global-Guided Normalization and Local Affinity Distillation
2024cites this paper
Utilizing data imbalance to enhance compound-protein interaction prediction models
2024cites this paper
PVContext: Hybrid Context Model for Point Cloud Compression
2024cites this paper
IdenBAT: Disentangled Representation Learning for Identity-Preserved Brain Age Transformation
2024cites this paper
Batch Normalization Alleviates the Spectral Bias in Coordinate Networks
2024cites this paper
ART-InvRec: An adversarial framework for rotation-invariant 3D object reconstruction
2024cites this paper
Predicting Late Gadolinium Enhancement of Acute Myocardial Infarction in Contrast-Free Cardiac Cine MRI Using Deep Generative Learning
2024cites this paper
HiFi-CS: Towards Open Vocabulary Visual Grounding For Robotic Grasping Using Vision-Language Models
2024cites this paper
GANs Conditioning Methods: A Survey
2024cites this paper
Towards the Spectral bias Alleviation by Normalizations in Coordinate Networks
2024cites this paper
ETBHD‐HMF: A Hierarchical Multimodal Fusion Architecture for Enhanced Text‐Based Hair Design
2024cites this paper
View-Guided Cost Volume for Light Field Arbitrary-View Disparity Estimation
2024cites this paper
Language-vision matching for text-to-image synthesis with context-aware GAN
2024cites this paper
Language-Guided Progressive Attention for Visual Grounding in Remote Sensing Images
2024cites this paper
Parameter Efficient Self-Supervised Geospatial Domain Adaptation
2024cites this paper
Subject Conditioning for Motor Imagery Using Attention Mechanism
2024cites this paper
Generative adversarial networks for handwriting image generation: a review
2024cites this paper
FlexLoc: Conditional Neural Networks for Zero-Shot Sensor Perspective Invariance in Object Localization with Distributed Multimodal Sensors
2024cites this paper
Meta-Learning Neural Procedural Biases
2024cites this paper
Conditional Latent Space Molecular Scaffold Optimization for Accelerated Molecular Design
2024cites this paper
Class‐Relation Reasoning with Knowledge‐Transfer for Few‐Shot Object Detection
2024cites this paper
On the Limits of Multi-modal Meta-Learning with Auxiliary Task Modulation Using Conditional Batch Normalization
2024cites this paper
UniQRNet: Unifying Referring Expression Grounding and Segmentation with QRNet
2024cites this paper
A review of multimodal learning for text to images
2024cites this paper
GeminiFusion: Efficient Pixel-wise Multimodal Fusion for Vision Transformer
2024cites this paper
Just Shift It: Test-Time Prototype Shifting for Zero-Shot Generalization with Vision-Language Models
2024cites this paper
Learning emotional prompt features with multiple views for visual emotion analysis
2024cites this paper
I2DFormer+: Learning Image to Document Summary Attention for Zero-Shot Image Classification
2024cites this paper
Reimagining Anomalies: What If Anomalies Were Normal?
2024cites this paper
Flexible image denoising model with multi-layer conditional feature modulation
2024cites this paper
Diffusion Model Conditioning on Gaussian Mixture Model and Negative Gaussian Mixture Gradient
2024cites this paper
Learning feature alignment across attribute domains for improving facial beauty prediction
2024cites this paper
Syntactic-Semantic Graph Fusion Generative Adversarial Network: SSGF-GAN
2024cites this paper
Multi-Feature Non-Intrusive Load Monitoring Method Based on Conditional Batch Normalization Combined with ResNet-KAN
2024cites this paper
RII-GAN: Multi-scaled Aligning-Based Reversed Image Interaction Network for Text-to-Image Synthesis
2024cites this paper
Reconstructing Regularly Missing Seismic Traces With a Classifier-Guided Diffusion Model
2024cites this paper
ERF-FEM: The Feature Extraction Module based on Effective Receptive Field used for Image Inpainting
2024cites this paper
Improving Reward-Conditioned Policies for Multi-Armed Bandits using Normalized Weight Functions
2024cites this paper
Human–machine co-creation: a complementary cognitive approach to creative character design process using GANs
2024cites this paper
X2V: 3D Organ Volume Reconstruction From a Planar X-Ray Image With Neural Implicit Methods
2024cites this paper
ESTGN: Enhanced Self-Mined Text Guided Super-Resolution Network for Superior Image Super Resolution
2024cites this paper
Multi-Stage Image-Language Cross-Generative Fusion Network for Video-Based Referring Expression Comprehension
2024cites this paper
CViT: Continuous Vision Transformer for Operator Learning
2024cites this paper
GarmentDreamer: 3DGS Guided Garment Synthesis with Diverse Geometry and Texture Details
2024cites this paper
HyperCLIP: Adapting Vision-Language models with Hypernetworks
2024cites this paper
Writer adaptation for offline text recognition: An exploration of neural network-based methods
2023cites this paper
AIGC for Various Data Modalities: A Survey
2023cites this paper
Learning Dense Correspondences between Photos and Sketches
2023cites this paper
Surf-D: High-Quality Surface Generation for Arbitrary Topologies using Diffusion Models
2023cites this paper
One-shot generative distribution matching for augmented RF-based UAV identification
2023cites this paper
Feature Attention as a Control Mechanism for the Balance of Speed and Accuracy in Visual Search
2023cites this paper
Closing the Gap Between Theory and Practice During Alternating Optimization for GANs
2023cites this paper
Virtual high-resolution MR angiography from non-angiographic multi-contrast MRIs: synthetic vascular model populations for in-silico trials
2023cites this paper
Lossless Adaptation of Pretrained Vision Models For Robotic Manipulation
2023cites this paper
Fundus Image-Label Pairs Synthesis and Retinopathy Screening via GANs With Class-Imbalanced Semi-Supervised Learning
2023cites this paper
Continuous conditional generative adversarial networks for data-driven modelling of geologic CO<mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" display="inline" id="d1e2027" altimg="si134.svg"><mml:msub><mml:mrow /><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:math> storage and plu
2023cites this paper
Deep Video Codec Control
2023cites this paper
Leveraging Bioclimatic Context for Supervised and Self-supervised Land Cover Classification
2023influential citation