Do You See What I Mean? Visual Resolution of Linguistic Ambiguities

Yevgeni Berzak,Andrei Barbu,Daniel Harari,Boris Katz,S. Ullman

Published 2015 in Conference on Empirical Methods in Natural Language Processing

ABSTRACT

Understanding language goes hand in hand with the ability to integrate complex contextual information obtained via perception. In this work, we present a novel task for grounded language understanding: disambiguating a sentence given a visual scene which depicts one of the possible interpretations of that sentence. To this end, we introduce a new multimodal corpus containing ambiguous sentences, representing a wide range of syntactic, semantic and discourse ambiguities, coupled with videos that visualize the different interpretations for each sentence. We address this task by extending a vision model which determines if a sentence is depicted by a video. We demonstrate how such a model can be adjusted to recognize different interpretations of the same underlying sentence, allowing to disambiguate sentences in a unified fashion across the different ambiguity types.

PUBLICATION RECORD

Publication year
2015
Venue
Conference on Empirical Methods in Natural Language Processing
Publication date
2015-09-01
Fields of study
Linguistics, Computer Science
Identifiers
DOI 10.18653/v1/D15-1172 arXiv 1603.08079
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Combining Language and Vision with a Multimodal Skip-gram Model
2015cited by this paper
The interaction of visual and linguistic saliency during syntactic ambiguity resolution
2015cited by this paper
Integrating Language and Vision to Generate Natural Language Descriptions of Videos in the Wild
2014cited by this paper
Deep visual-semantic alignments for generating image descriptions
2014cited by this paper
Learning Grounded Meaning Representations with Autoencoders
2014cited by this paper
Improving Image-Sentence Embeddings Using Large Weakly Annotated Photo Collections
2014cited by this paper
Grounded Compositional Semantics for Finding and Describing Images with Sentences
2014cited by this paper
Show and tell: A neural image caption generator
2014cited by this paper
Linking People in Videos with "Their" Names Using Coreference Resolution
2014cited by this paper
Translating Videos to Natural Language Using Deep Recurrent Neural Networks
2014cited by this paper
What Are You Talking About? Text-to-Image Coreference
2014cited by this paper
Microsoft COCO: Common Objects in Context
2014cited by this paper
BabyTalk: Understanding and Generating Simple Image Descriptions
2013cited by this paper
Grounding Action Descriptions in Videos
2013cited by this paper
Seeing What You're Told: Sentence-Guided Activity Recognition in Video
2013influential reference
Zero-Shot Learning Through Cross-Modal Transfer
2013cited by this paper
Midge: Generating Image Descriptions From Computer Vision Detections
2012cited by this paper
ImageNet classification with deep convolutional neural networks
2012cited by this paper
Distributional Semantics in Technicolor
2012cited by this paper
HMDB: A large video database for human motion recognition
2011cited by this paper
Object Detection with Discriminatively Trained Part Based Models
2010cited by this paper
Every Picture Tells a Story: Generating Sentences from Images
2010cited by this paper
Recognizing realistic actions from videos “in the wild”
2009influential reference
Children's use of gesture to resolve lexical ambiguity.
2009cited by this paper
Action MACH a spatio-temporal Maximum Average Correlation Height filter for action recognition
2008cited by this paper
Word Sense Disambiguation with Pictures
2003cited by this paper
Eye movements and spoken language comprehension: effects of visual context on syntactic ambiguity resolution.
2002cited by this paper
Performance of Communication Systems
2001cited by this paper
Integration of visual and linguistic information in spoken language comprehension.
1995cited by this paper
Mothers' Speech to Children Learning Language.
1972cited by this paper
Convolutional Codes and 'Their Performance in Communication Systems
1971cited by this paper
A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains
1970influential reference
of the Association for Computational Linguistics
year unknowncited by this paper

CITED BY

FOCUS: Evaluating Pre-trained Vision-Language Models on Underspecification Reasoning
2025cites this paper
FDS: Frequency-Aware Denoising Score for Text-Guided Latent Diffusion Image Editing
2025cites this paper
RACQUET: Unveiling the Dangers of Overlooked Referential Ambiguity in Visual LLMs
2024cites this paper
Do Pre-Trained Language Models Detect and Understand Semantic Underspecification? Ask the DUST!
2024cites this paper
Embedded Cognition in Virtual Environments: An Ecological Approach to AI Study
2023cites this paper
Rephrase, Augment, Reason: Visual Grounding of Questions for Vision-Language Models
2023cites this paper
Resolving Ambiguities in Text-to-Image Generative Models
2023influential citation
Zero and Few-shot Semantic Parsing with Ambiguous Inputs
2023cites this paper
Finding Structure in One Child's Linguistic Experience.
2023cites this paper
EMMT: A simultaneous eye-tracking, 4-electrode EEG and audio corpus for multi-modal reading and translation scenarios
2022cites this paper
Is the Elephant Flying? Resolving Ambiguities in Text-to-Image Generative Models
2022cites this paper
Does a walk-through video help the parser down the garden-path? A visually enhanced self-paced reading study in Dutch
2022cites this paper
Resolving Linguistic Ambiguities by Visual Context
2022influential citation
Situation-Specific Multimodal Feature Adaptation
2021influential citation
Linguistic issues behind visual question answering
2021cites this paper
Representation of ambiguity in pretrained models and the problem of domain specificity
2021cites this paper
Transductive Visual Verb Sense Disambiguation
2020cites this paper
Grounded language interpretation of robotic commands through structured learning
2020cites this paper
Neural Semantic Pointers in Context
2020influential citation
Referring to Objects in Videos Using Spatio-Temporal Identifying Descriptions
2019cites this paper
Understanding language through visual imagination
2019cites this paper
Learning Language from Vision
2019cites this paper
Using Syntax to Ground Referring Expressions in Natural Images
2018cites this paper
Assessing multilingual multimodal image description: Studies of native speaker preferences and translator choices
2018cites this paper
Grounding language acquisition by training semantic parsers using captioned videos
2018influential citation
Incorporating Contextual Information for Language-Independent, Dynamic Disambiguation Tasks
2018cites this paper
Towards a Systematic Analysis of Linguistic and Visual Complexity in Disambiguation and Structural Prediction
2017cites this paper
A Multi-modal Data-Set for Systematic Analyses of Linguistic Ambiguities in Situated Contexts
2017cites this paper
Resolving vision and language ambiguities together: Joint segmentation & prepositional attachment resolution in captioned scenes
2017cites this paper
Temporal Grounding Graphs for Language Understanding with Accrued Visual-Linguistic Context
2017cites this paper
Comparing Data Sources and Architectures for Deep Visual Representation Learning in Semantics
2016cites this paper
Resolving Language and Vision Ambiguities Together: Joint Segmentation & Prepositional Attachment Resolution in Captioned Scenes
2016cites this paper
Virtual Embodiment: A Scalable Long-Term Strategy for Artificial Intelligence Research
2016cites this paper
Investigations into Semantic Underspecification in Language Models
year unknowncites this paper
Journal Pre-proof Grounded Language Interpretation of Robotic Commands through Structured Learning
year unknowncites this paper
MMA: B ENCHMARKING M ULTI - M ODAL L ARGE L AN - GUAGE M ODELS IN A MBIGUITY C ONTEXTS
year unknowncites this paper