An Effective Data Augmentation Method by Asking Questions about Scene Text Images

Published 2026 in Unknown venue

ABSTRACT

Scene text recognition (STR) and handwritten text recognition (HTR) face significant challenges in accurately transcribing textual content from images into machine-readable formats. Conventional OCR models often predict transcriptions directly, which limits detailed reasoning about text structure. We propose a VQA-inspired data augmentation framework that strengthens OCR training through structured question-answering tasks. For each image-text pair, we generate natural-language questions probing character-level attributes such as presence, position, and frequency, with answers derived from ground-truth text. These auxiliary tasks encourage finer-grained reasoning, and the OCR model aligns visual features with textual queries to jointly reason over images and questions. Experiments on WordArt and Esposalles datasets show consistent improvements over baseline models, with significant reductions in both CER and WER. Our code is publicly available at https://github.com/xuyaooo/DataAugOCR.

PUBLICATION RECORD

Publication year
2026
Venue
Unknown venue
Publication date
2026-03-03
Fields of study
Computer Science
Identifiers
arXiv 2603.03580
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Instruction-Guided Scene Text Recognition
2024cited by this paper
Toward Understanding WordArt: Corner-Guided Transformer for Scene Text Recognition
2022cited by this paper
BEiT: BERT Pre-Training of Image Transformers
2021cited by this paper
TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models
2021cited by this paper
Data Augmentation for Scene Text Recognition
2021cited by this paper
Towards VQA Models That Can Read
2019cited by this paper
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
2019cited by this paper
RoBERTa: A Robustly Optimized BERT Pretraining Approach
2019cited by this paper
Randaugment: Practical automated data augmentation with a reduced search space
2019cited by this paper
OCR-VQA: Visual Question Answering by Reading Text in Images
2019cited by this paper
Data Augmentation for Recognition of Handwritten Words and Lines Using a CNN-LSTM Network
2017cited by this paper
Attention is All you Need
2017cited by this paper
Decoupled Weight Decay Regularization
2017influential reference
ICDAR2017 Competition on Information Extraction in Historical Handwritten Records
2017cited by this paper
Automatic differentiation in PyTorch
2017cited by this paper
CNN-N-Gram for HandwritingWord Recognition
2016cited by this paper
VQA: Visual Question Answering
2015cited by this paper
A Multi-World Approach to Question Answering about Real-World Scenes based on Uncertain Input
2014cited by this paper

CITED BY

No citing papers are available for this paper.