Deep Reinforcement Learning-Based Image Captioning with Embedding Reward

Zhou Ren,Xiaoyu Wang,Ning Zhang,Xutao Lv,Li-Jia Li

Published 2017 in Computer Vision and Pattern Recognition

ABSTRACT

Image captioning is a challenging problem owing to the complexity in understanding the image content and diverse ways of describing it in natural language. Recent advances in deep neural networks have substantially improved the performance of this task. Most state-of-the-art approaches follow an encoder-decoder framework, which generates captions using a sequential recurrent prediction model. However, in this paper, we introduce a novel decision-making framework for image captioning. We utilize a policy network and a value network to collaboratively generate captions. The policy network serves as a local guidance by providing the confidence of predicting the next word according to the current state. Additionally, the value network serves as a global and lookahead guidance by evaluating all possible extensions of the current state. In essence, it adjusts the goal of predicting the correct words towards the goal of generating captions similar to the ground truth captions. We train both networks using an actor-critic reinforcement learning model, with a novel reward defined by visual-semantic embedding. Extensive experiments and analyses on the Microsoft COCO dataset show that the proposed framework outperforms state-of-the-art approaches across different evaluation metrics.

PUBLICATION RECORD

Publication year
2017
Venue
Computer Vision and Pattern Recognition
Publication date
2017-04-12
Fields of study
Computer Science
Identifiers
DOI 10.1109/CVPR.2017.128 arXiv 1704.03899
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Visual Question Answering
2017cited by this paper
Dense Captioning with Joint Inference and Visual Context
2016cited by this paper
Target-driven visual navigation in indoor scenes using deep reinforcement learning
2016cited by this paper
Automatic Annotation of Structured Facts in Images
2016cited by this paper
Modeling Context in Referring Expressions
2016cited by this paper
Mastering the game of Go with deep neural networks and tree search
2016cited by this paper
Diverse Beam Search: Decoding Diverse Solutions from Neural Sequence Models
2016cited by this paper
Joint Image-Text Representation by Gaussian Visual-Semantic Embedding
2016cited by this paper
Image Captioning with Semantic Attention
2016influential reference
Asynchronous Methods for Deep Reinforcement Learning
2016cited by this paper
Face Detection with the Faster R-CNN
2016cited by this paper
Modeling Context Between Objects for Referring Expression Understanding
2016cited by this paper
Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations
2016cited by this paper
Microsoft COCO Captions: Data Collection and Evaluation Server
2015cited by this paper
Human-level control through deep reinforcement learning
2015cited by this paper
Simple Image Description Generator via a Linear Phrase-based Model
2015cited by this paper
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
2015cited by this paper
Fast R-CNN
2015cited by this paper
Spatial Transformer Networks
2015cited by this paper
You Only Look Once: Unified, Real-Time Object Detection
2015cited by this paper
Guiding Long-Short Term Memory for Image Caption Generation
2015cited by this paper
Mind's eye: A recurrent visual representation for image caption generation
2015cited by this paper
Deep Compositional Captioning: Describing Novel Object Categories without Paired Training Data
2015cited by this paper
Inside-Outside Net: Detecting Objects in Context with Skip Pooling and Recurrent Neural Networks
2015cited by this paper
RECURRENT NEURAL NETWORKS
2015influential reference
Aligning where to see and what to tell: image caption with region-based attention and scene factorization
2015cited by this paper
Guiding the Long-Short Term Memory Model for Image Caption Generation
2015cited by this paper
Instance-Aware Semantic Segmentation via Multi-task Network Cascades
2015cited by this paper
Ask Your Neurons: A Neural-Based Approach to Answering Questions about Images
2015cited by this paper
Generation and Comprehension of Unambiguous Object Descriptions
2015cited by this paper
Deep Residual Learning for Image Recognition
2015cited by this paper
SSD: Single Shot MultiBox Detector
2015cited by this paper
Natural Language Object Retrieval
2015cited by this paper
DenseCap: Fully Convolutional Localization Networks for Dense Captioning
2015cited by this paper
Jointly Modeling Embedding and Translation to Bridge Video and Language
2015cited by this paper
Multi-Instance Visual-Semantic Embedding
2015cited by this paper
Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks
2015cited by this paper
Adam: A Method for Stochastic Optimization
2014cited by this paper
Show and tell: A neural image caption generator
2014influential reference
Fully convolutional networks for semantic segmentation
2014cited by this paper
Caffe: Convolutional Architecture for Fast Feature Embedding
2014cited by this paper
Deep Fragment Embeddings for Bidirectional Image Sentence Mapping
2014cited by this paper
From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions
2014cited by this paper
Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN)
2014cited by this paper
Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling
2014cited by this paper
Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation
2014cited by this paper
Long-term recurrent convolutional networks for visual recognition and description
2014cited by this paper
Simple Image Description Generator via a Linear Phrase-Based Approach
2014cited by this paper
The Role of Context for Object Detection and Semantic Segmentation in the Wild
2014cited by this paper
Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models
2014influential reference
Deep visual-semantic alignments for generating image descriptions
2014influential reference
CIDEr: Consensus-based image description evaluation
2014cited by this paper
From captions to visual concepts and back
2014cited by this paper
Microsoft COCO: Common Objects in Context
2014cited by this paper
Very Deep Convolutional Networks for Large-Scale Image Recognition
2014influential reference
Going deeper with convolutions
2014cited by this paper
Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation
2013cited by this paper
Image Description using Visual Dependency Representations
2013cited by this paper
DeViSE: A Deep Visual-Semantic Embedding Model
2013cited by this paper
BabyTalk: Understanding and Generating Simple Image Descriptions
2013cited by this paper
Collective Generation of Natural Image Descriptions
2012cited by this paper
ImageNet classification with deep convolutional neural networks
2012cited by this paper
Corpus-Guided Sentence Generation of Natural Images
2011cited by this paper
Composing Simple Image Descriptions using Web-scale N-grams
2011cited by this paper
Generating Text with Recurrent Neural Networks
2011cited by this paper
Baby Talk: Understanding and Generating Image Descriptions
2011cited by this paper
Every Picture Tells a Story: Generating Sentences from Images
2010cited by this paper
Recurrent neural network based language model
2010cited by this paper
ImageNet: A large-scale hierarchical image database
2009cited by this paper
An empirical study of context in object detection
2009cited by this paper
The Meteor metric for automatic evaluation of machine translation
2009cited by this paper
Curriculum learning
2009cited by this paper
METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments
2005cited by this paper
ROUGE: A Package for Automatic Evaluation of Summaries
2004cited by this paper
Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning
2004influential reference
A neural probabilistic language model
2003cited by this paper
Bleu: a Method for Automatic Evaluation of Machine Translation
2002cited by this paper
Actor-Critic Algorithms
1999cited by this paper
Policy Gradient Methods for Reinforcement Learning with Function Approximation
1999cited by this paper
Gradient-based learning applied to document recognition
1998cited by this paper
Long Short-Term Memory
1997cited by this paper
Generalization of backpropagation with application to a recurrent gas market model
1988cited by this paper
Shifts in selective visual attention: towards the underlying neural circuitry.
1985cited by this paper

CITED BY

CLIP-RL: Closed-Loop Video Inpainting with Detection-Guided Reinforcement Learning
2026cites this paper
CCCaption: Dual-Reward Reinforcement Learning for Complete and Correct Image Captioning
2026cites this paper
FigEx2: Visual-Conditioned Panel Detection and Captioning for Scientific Compound Figures
2026cites this paper
From Pixels to Words: A Comprehensive Study of CNN-RNN Models for Image Captioning
2025cites this paper
PRPO: Paragraph-level Policy Optimization for Vision-Language Deepfake Detection
2025cites this paper
MORAL: A Multimodal Reinforcement Learning Framework for Decision Making in Autonomous Laboratories
2025cites this paper
EKCA-Cap: Improving dense captioning via external knowledge and context awareness
2025cites this paper
Optimizing CLAP Reward with LLM Feedback for Semantically Aligned and Diverse Automated Audio Captioning
2025cites this paper
Promoting cooperation in the voluntary prisoner's dilemma game via reinforcement learning.
2025cites this paper
Innovative Dual-Objective Framework for Image Captioning: Harmonizing Visual Analysis and Linguistic Precision
2025influential citation
Surveying the Landscape of Image Captioning Evaluation: A Comprehensive Taxonomy, Trends, and Metrics Analysis
2024cites this paper
Fluent and Accurate Image Captioning with a Self-Trained Reward Model
2024cites this paper
From RAG to Riches: Retrieval Interlaced with Sequence Generation
2024cites this paper
Agentic AI: The Era of Semantic Decoding
2024cites this paper
Surveying the Landscape of Image Captioning Evaluation: A Comprehensive Taxonomy and Novel Ensemble Method
2024cites this paper
Explicit Image Caption Reasoning: Generating Accurate and Informative Captions for Complex Scenes with LMM
2024cites this paper
Self-Supervised Learning-Based Algorithm for Chinese Image Caption Generation in Electric Robot Inspection Processing
2024influential citation
Image Captioning Using Vision Encoder Decoder Model
2024cites this paper
Automated Image Annotation with Voice Synthesis Using Machine Learning
2024cites this paper
From vision to text: A comprehensive review of natural image captioning in medical diagnosis and radiology report generation
2024cites this paper
Consensus-Agent Deep Reinforcement Learning for Face Aging
2024cites this paper
Visual contextual relationship augmented transformer for image captioning
2024cites this paper
Combinatorial Analysis of Deep Learning and Machine Learning Video Captioning Studies: A Systematic Literature Review
2024cites this paper
A Survey on Large Language Models from Concept to Implementation
2024cites this paper
An Online Defense against Object-based LiDAR Attacks in Autonomous Driving
2024cites this paper
Language-Informed Beam Search Decoding for Multilingual Machine Translation
2024cites this paper
Video captioning – a survey
2024cites this paper
Reinforcement Learning for Generative AI: A Survey
2023cites this paper
Introduction of Virtual Environment with Learning Personalized Content Recommendation for Realistic Avatar Creation in Online Platform using Machine Learning
2023cites this paper
A comprehensive survey on image captioning: from handcrafted to deep learning-based techniques, a taxonomy and open research issues
2023cites this paper
A comprehensive survey on deep-learning-based visual captioning
2023influential citation
Collaborative Multi-Agent Video Fast-Forwarding
2023cites this paper
Neural Image Caption Generation with Visual Attention : Enabling Image Accessibility for the Visually Impaired
2023cites this paper
Storytelling with Image Data: A Systematic Review and Comparative Analysis of Methods and Tools
2023cites this paper
Learning to Generate Better Than Your LLM
2023cites this paper
A Survey on Learning Objects' Relationship for Image Captioning
2023cites this paper
Diverse Image Captioning via Conditional Variational Autoencoder and Dual Contrastive Learning
2023cites this paper
Multi-modal reward for visual relationships-based image captioning
2023influential citation
Context-Adaptive-Based Image Captioning by Bi-CARU
2023cites this paper
Supervised Deep Learning Techniques for Image Description: A Systematic Review
2023cites this paper
A Survey on Attention-Based Models for Image Captioning
2023cites this paper
A Review on Automatic Image Caption Generation for Various Deep Learning Approaches
2023cites this paper
Exploring Deep Learning Approaches for Video Captioning: A Comprehensive Review
2023cites this paper
Joint Embedding of Deep Visual and Semantic Features for Medical Image Report Generation
2023cites this paper
Intensive vision-guided network for radiology report generation
2023cites this paper
Exploring a Spectrum of Deep Learning Models for Automated Image Captioning: A Comprehensive Survey
2023cites this paper
Disentangled Representation Learning for Controllable Person Image Generation
2023cites this paper
Image Caption Generation Based on Image-Text Matching Schema in Deep Reinforcement Learning
2023influential citation
A Comparative Study of Pre-trained CNNs and GRU-Based Attention for Image Caption Generation
2023cites this paper
From Templates to Transformers: A survey of Multimodal Image Captioning Decoders
2023cites this paper
ℓ Gym: Natural Language Visual Reasoning with Reinforcement Learning
2022cites this paper
ALCAP: Alignment-Augmented Music Captioner
2022cites this paper
AFSN: Adaptive Frame Selection Network for Efficient Human Synthesis on Smartphones
2022cites this paper
RECAP: Retrieval Augmented Music Captioner
2022cites this paper
Parallel Pathway Dense Video Captioning With Deformable Transformer
2022cites this paper
Modeling graph-structured contexts for image captioning
2022cites this paper
CAVAN: Commonsense Knowledge Anchored Video Captioning
2022cites this paper
Choisir le bon co-équipier pour la génération coopérative de texte (Choosing The Right Teammate For Cooperative Text Generation)
2022cites this paper
Double-Stream Position Learning Transformer Network for Image Captioning
2022cites this paper
Semantic Communications for Future Internet: Fundamentals, Applications, and Challenges
2022cites this paper
Discrete space reinforcement learning algorithm based on twin support vector machine classification
2022cites this paper
Semantic Communications for 6G Future Internet: Fundamentals, Applications, and Challenges
2022cites this paper
Image captioning with residual swin transformer and Actor-Critic
2022cites this paper
Explicit Image Caption Editing
2022cites this paper
Reinforcement-Learning-Guided Source Code Summarization Using Hierarchical Attention
2022cites this paper
Query and Attention Augmentation for Knowledge-Based Explainable Reasoning
2022cites this paper
Efficient Modeling of Future Context for Image Captioning
2022cites this paper
Language Model Decoding as Likelihood–Utility Alignment
2022cites this paper
Are metrics measuring what they should? An evaluation of image captioning task metrics
2022cites this paper
Which Discriminator for Cooperative Text Generation?
2022cites this paper
Video Captioning Using Global-Local Representation
2022cites this paper
CADxReport: Chest x-ray report generation using co-attention mechanism and reinforcement learning
2022cites this paper
NWPU-Captions Dataset and MLCA-Net for Remote Sensing Image Captioning
2022cites this paper
Learning cross-modality features for image caption generation
2022cites this paper
A Deep Neural Framework for Image Caption Generation Using GRU-Based Attention Mechanism
2022cites this paper
Image Captioning Encoder–Decoder Models Using CNN-RNN Architectures: A Comparative Study
2022cites this paper
CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning
2022cites this paper
A Metaverse: Taxonomy, Components, Applications, and Open Challenges
2022cites this paper
Prophet Attention: Predicting Attention with Future Attention
2022cites this paper
Modeling of Hyperparameter Tuned Deep Learning Model for Automated Image Captioning
2022cites this paper
Belief Revision Based Caption Re-ranker with Visual Semantic Information
2022cites this paper
From Coarse to Fine: Hierarchical Structure-aware Video Summarization
2022cites this paper
Customer Experience towards the Product during a Coronavirus Outbreak
2022cites this paper
Policy and Value Deep RL for Temporal Language-Agnostic Street Image Captioning
2021influential citation
Weakly Supervised Reinforced Multi-Operator Image Retargeting
2021cites this paper
Image Captioning through Cognitive IOT and Machine-Learning Approaches
2021cites this paper
Deep Reinforcement Learning: A New Frontier in Computer Vision Research
2021cites this paper
GilBERT: Generative Vision-Language Pre-Training for Image-Text Retrieval
2021cites this paper
Reinforcement Learning-powered Semantic Communication via Semantic Similarity
2021cites this paper
From Show to Tell: A Survey on Image Captioning
2021cites this paper
A Comparison of Domain-Specific and Open-Domain Image Caption Generation
2021cites this paper
A Survey on Image Captioning datasets and Evaluation Metrics
2021cites this paper
Image Captioning using Artificial Intelligence
2021cites this paper
A Survey on Image Encoders and Language Models for Image Captioning
2021cites this paper
Consensus Graph Representation Learning for Better Grounded Image Captioning
2021cites this paper
Past is important: Improved image captioning by looking back in time
2021cites this paper
Chinese description generation of dual attention images based on multi-modal fusion
2021cites this paper
A detailed review of prevailing image captioning methods using deep learning techniques
2021cites this paper
Generation and Evaluation of Hindi Image Captions of Visual Genome
2021cites this paper
Features to Text: A Comprehensive Survey of Deep Learning on Semantic Segmentation and Image Captioning
2021cites this paper