Unsupervised Visual Sense Disambiguation for Verbs using Multimodal Embeddings

Spandana Gella,Mirella Lapata,Frank Keller

Published 2016 in North American Chapter of the Association for Computational Linguistics

ABSTRACT

We introduce a new task, visual sense disambiguation for verbs: given an image and a verb, assign the correct sense of the verb, i.e., the one that describes the action depicted in the image. Just as textual word sense disambiguation is useful for a wide range of NLP tasks, visual sense disambiguation can be useful for multimodal tasks such as image retrieval, image description, and text illustration. We introduce VerSe, a new dataset that augments existing multimodal datasets (COCO and TUHOI) with sense labels. We propose an unsupervised algorithm based on Lesk which performs visual sense disambiguation using textual, visual, or multimodal embeddings. We find that textual embeddings perform well when gold-standard textual annotations (object labels and image descriptions) are available, while multimodal embeddings perform well on unannotated images. We also verify our findings by using the textual and multimodal embeddings as features in a supervised setting and analyse the performance of visual sense disambiguation task. VerSe is made publicly available and can be downloaded at: this https URL

PUBLICATION RECORD

Publication year
2016
Venue
North American Chapter of the Association for Computational Linguistics
Publication date
2016-03-30
Fields of study
Computer Science
Identifiers
DOI 10.18653/v1/N16-1022 arXiv 1603.09188
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

AutoExtend: Extending Word Embeddings to Embeddings for Synsets and Lexemes
2015influential reference
Describing Common Human Visual Actions in Images
2015cited by this paper
Sense discovery via co-clustering on images and text
2015influential reference
HICO: A Benchmark for Recognizing Human-Object Interactions in Images
2015influential reference
Ontologically Grounded Multi-sense Representation Learning for Semantic Vector Space Models
2015cited by this paper
VQA: Visual Question Answering
2015cited by this paper
Deep correlation for matching images and text
2015cited by this paper
On Deep Multi-View Representation Learning
2015influential reference
Very Deep Convolutional Networks for Large-Scale Image Recognition
2014cited by this paper
Show and tell: A neural image caption generator
2014cited by this paper
The Pascal Visual Object Classes Challenge: A Retrospective
2014cited by this paper
Caffe: Convolutional Architecture for Fast Feature Embedding
2014cited by this paper
GloVe: Global Vectors for Word Representation
2014cited by this paper
Improving Image-Sentence Embeddings Using Large Weakly Annotated Photo Collections
2014cited by this paper
Deep visual-semantic alignments for generating image descriptions
2014cited by this paper
From captions to visual concepts and back
2014cited by this paper
TUHOI: Trento Universal Human Object Interaction Dataset
2014cited by this paper
Efficient Estimation of Word Representations in Vector Space
2013cited by this paper
Deep Canonical Correlation Analysis
2013influential reference
Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics
2013cited by this paper
Exploiting language models to recognize unseen actions
2013cited by this paper
Human action recognition by learning bases of action attributes and parts
2011cited by this paper
It Makes Sense: A Wide-Coverage Word Sense Disambiguation System for Free Text
2010cited by this paper
Grouplet: A structured image representation for recognizing human and object interactions
2010cited by this paper
Word sense disambiguation: A survey
2009cited by this paper
ImageNet: A large-scale hierarchical image database
2009influential reference
Good Neighbors Make Good Senses: Exploiting Distributional Similarity for Unsupervised WSD
2008cited by this paper
Unsupervised Learning of Visual Sense Models for Polysemous Words
2008cited by this paper
OntoNotes: The 90% Solution
2006influential reference
Discriminating Image Senses by Clustering with Multimodal Features
2006cited by this paper
Finding Predominant Word Senses in Untagged Text
2004cited by this paper
Canonical Correlation Analysis: An Overview with Application to Learning Methods
2004cited by this paper
Word Sense Disambiguation with Pictures
2003cited by this paper
Proceedings of the
1999cited by this paper
SENSEVAL: an exercise in evaluating world sense disambiguation programs
1998cited by this paper
Using Syntactic Dependency as Local Context to Resolve Word Sense Ambiguity
1997cited by this paper
English Verb Classes and Alternations: A Preliminary Investigation
1993cited by this paper
Introduction to WordNet: An On-line Lexical Database
1990cited by this paper
Automatic sense disambiguation using machine readable dictionaries: how to tell a pine cone from an ice cream cone
1986cited by this paper

CITED BY

Bridging Lexical Ambiguity and Vision: A Mini Review on Visual Word Sense Disambiguation
2026influential citation
Framed Multi30K: A Frame-Based Multimodal-Multilingual Dataset
2024influential citation
HiFormer: Hierarchical transformer for grounded situation recognition
2024cites this paper
PolCLIP: A Unified Image-Text Word Sense Disambiguation Model via Generating Multimodal Complementary Representations
2024cites this paper
SUT at SemEval-2023 Task 1: Prompt Generation for Visual Word Sense Disambiguation
2023cites this paper
Rutgers Multimedia Image Processing Lab at SemEval-2023 Task-1: Text-Augmentation-based Approach for Visual Word Sense Disambiguation
2023cites this paper
From Isolated Islands to Pangea: Unifying Semantic Space for Human Action Understanding
2023influential citation
Vision Meets Definitions: Unsupervised Visual Word Sense Disambiguation Incorporating Gloss Information
2023cites this paper
Seeing the Intangible: Survey of Image Classification into High-Level and Abstract Categories
2023cites this paper
HKUST at SemEval-2023 Task 1: Visual Word Sense Disambiguation with Context Augmentation and Visual Assistance
2023cites this paper
GPL at SemEval-2023 Task 1: WordNet and CLIP to Disambiguate Images
2023cites this paper
Visual Definition Modeling: Challenging Vision & Language Models to Define Words and Objects
2022cites this paper
An Action Is Worth Multiple Words: Handling Ambiguity in Action Recognition
2022cites this paper
The Case for Perspective in Multimodal Datasets
2022cites this paper
Predicting emergent linguistic compositions through time: Syntactic frame extension via multimodal chaining
2021cites this paper
MultiSubs: A Large-scale Multimodal and Multilingual Dataset
2021cites this paper
Grounded language interpretation of robotic commands through structured learning
2020cites this paper
Generating need-adapted multimodal fragments
2020cites this paper
Grounded Situation Recognition
2020cites this paper
Activities of Daily Living Monitoring via a Wearable Camera: Toward Real-World Applications
2020cites this paper
Fatality Killed the Cat or: BabelPic, a Multimodal Dataset for Non-Concrete Concepts
2020cites this paper
Transductive Visual Verb Sense Disambiguation
2020cites this paper
Uni- and Multimodal and Structured Representations for Modeling Frame Semantics
2019cites this paper
Learning Visual Actions Using Multiple Verb-Only Labels
2019cites this paper
Visual context for verb sense disambiguation and multilingual representation learning
2019influential citation
Report of 2017 NSF Workshop on Multimedia Challenges, Opportunities and Research Roadmaps
2019cites this paper
Cross-lingual Visual Verb Sense Disambiguation
2019cites this paper
Dynamically Visual Disambiguation of Keyword-based Image Search
2019influential citation
Grounded Word Sense Translation
2019cites this paper
Zero-Shot Video Retrieval from a Query Phrase Including Multiple Concepts —Efforts and Challenges in TRECVID AVS Task—
2018cites this paper
Visual Choice of Plausible Alternatives: An Evaluation of Image-based Commonsense Causal Reasoning
2018cites this paper
Multimodal Lexical Translation
2018cites this paper
Bridging Languages through Images with Deep Partial Canonical Correlation Analysis
2018cites this paper
Visual Relationship Detection With Deep Structural Ranking
2018cites this paper
From image to language and back again
2018cites this paper
Assessing multilingual multimodal image description: Studies of native speaker preferences and translator choices
2018cites this paper
Towards an Unequivocal Representation of Actions
2018cites this paper
Action Categorisation in Multimodal Instructions
2018cites this paper
Learning Language-Independent Representations of Verbs and Adjectives from Multimodal Retrieval
2018cites this paper
An Evaluation of Image-Based Verb Prediction Models against Human Eye-Tracking Data
2018cites this paper
Estudio de métodos semisupervisados para la desambiguación de sentidos verbales del español
2018cites this paper
Multimodal Frame Identification with Multilingual Evaluation
2018cites this paper
An Analysis of Action Recognition Datasets for Language and Vision Tasks
2017influential citation
Resolving vision and language ambiguities together: Joint segmentation & prepositional attachment resolution in captioned scenes
2017cites this paper
A Survey of Machine Learning for Big Code and Naturalness
2017cites this paper
Findings of the Second Shared Task on Multimodal Machine Translation and Multilingual Image Description
2017cites this paper
Jointly Representing Images and Text: Dependency Graphs, Word Senses, and Multimodal Embeddings
2016cites this paper
Resolving Language and Vision Ambiguities Together: Joint Segmentation & Prepositional Attachment Resolution in Captioned Scenes
2016cites this paper
The Development of Multimodal Lexical Resources
2016cites this paper
Stacking With Auxiliary Features: Improved Ensembling for Natural Language and Vision
2016cites this paper
Annotation Methodologies for Vision and Language Dataset Creation
2016cites this paper
Learning Actions from Events Using Agent Motions
year unknowncites this paper
Action Categorisation in Multimodal Instructions
year unknowncites this paper
Journal Pre-proof Grounded Language Interpretation of Robotic Commands through Structured Learning
year unknowncites this paper