Vision-to-Voice: Enhanced CNN-LSTM-Based Image Captioning with Assistive Text-to-Speech

Arunya Paul,Tejaswini Kar,S. Pahadsingh,Alokita Paul,Shruti

Published 2025 in 2025 IEEE 2nd International Conference on Green Industrial Electronics and Sustainable Technologies (GIEST)

ABSTRACT

Image captioning is a multidisciplinary task that combines the capabilities of computer vision and natural language processing, enabling automatic generation of descriptive text for images. This paper presents an approach that leverages Convolutional Neural Networks (CNN) and Long Short-Term Memory (LSTM) networks for enhanced image captioning, trained on the rich Flickr8k dataset, which provides diverse real-world images with multiple humanannotated captions. Our work differentiates itself by optimizing feature extraction and sequence generation to improve contextual accuracy and fluency. CNNs extract essential visual features, which are then processed by an LSTM-based decoder to generate coherent and meaningful captions while retaining contextual information. Additionally, we introduce an assistive text-to-voice feature that reads out the generated captions, making the system more accessible for the visually impaired. Experimental results demonstrate improved caption quality compared to existing approaches. This framework has broad applications, from assistive technologies to multimedia content enrichment, further advancing semantic understanding and human-computer interactions.

PUBLICATION RECORD

Publication year
2025
Venue
2025 IEEE 2nd International Conference on Green Industrial Electronics and Sustainable Technologies (GIEST)
Publication date
2025-10-11
Fields of study
Not labeled
Identifiers
DOI 10.1109/GIEST66547.2025.11387720
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Image caption generator using CNN & LSTM
2023cited by this paper
Domain-Specific Image Caption Generator with Semantic Ontology
2020cited by this paper
Towards Personalized Image Captioning via Multimodal Memory Networks
2019cited by this paper
Automatic Image Captioning Using Convolution Neural Networks and LSTM
2019cited by this paper
Image Captioning: Transforming Objects into Words
2019cited by this paper
A survey on automatic image caption generation
2018cited by this paper
Regularizing RNNs for Caption Generation by Reconstructing the Past with the Present
2018cited by this paper
SemStyle: Learning to Generate Stylised Image Captions Using Unaligned Text
2018cited by this paper
Convolutional Image Captioning
2017cited by this paper
Deep learning in big data Analytics: A comparative study
2017cited by this paper
Boosting Image Captioning with Attributes
2016cited by this paper
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
2015cited by this paper
Show and tell: A neural image caption generator
2014cited by this paper
Neural Machine Translation by Jointly Learning to Align and Translate
2014cited by this paper
Novel neural modulators.
2003cited by this paper

CITED BY

No citing papers are available for this paper.