Learning to Extract Semantic Structure from Documents Using Multimodal Fully Convolutional Neural Networks

Xiao Yang,Ersin Yumer,P. Asente,Mike Kraley,Daniel Kifer,C. Lee Giles

Published 2017 in Computer Vision and Pattern Recognition

ABSTRACT

We present an end-to-end, multimodal, fully convolutional network for extracting semantic structures from document images. We consider document semantic structure extraction as a pixel-wise segmentation task, and propose a unified model that classifies pixels based not only on their visual appearance, as in the traditional page segmentation task, but also on the content of underlying text. Moreover, we propose an efficient synthetic document generation process that we use to generate pretraining data for our network. Once the network is trained on a large set of synthetic documents, we fine-tune the network on unlabeled real documents using a semi-supervised approach. We systematically study the optimum network architecture and show that both our multimodal approach and the synthetic data pretraining significantly boost the performance.

PUBLICATION RECORD

Publication year
2017
Venue
Computer Vision and Pattern Recognition
Publication date
2017-06-07
Fields of study
Computer Science
Identifiers
DOI 10.1109/CVPR.2017.462 arXiv 1706.02337
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Synthesized Classifiers for Zero-Shot Learning
2016cited by this paper
Dense prediction for text line segmentation in handwritten document images
2016cited by this paper
A Discriminative Feature Learning Approach for Deep Face Recognition
2016cited by this paper
Augmenting Supervised Neural Networks with Unsupervised Objectives for Large-scale Image Classification
2016cited by this paper
Learning to Refine Object Segments
2016cited by this paper
Enriching Word Vectors with Subword Information
2016cited by this paper
Bag of Tricks for Efficient Text Classification
2016cited by this paper
Are You Talking to a Machine? Dataset and Methods for Multilingual Image Question
2015cited by this paper
Learning Deconvolution Network for Semantic Segmentation
2015cited by this paper
Ask Your Neurons: A Neural-Based Approach to Answering Questions about Images
2015cited by this paper
Weakly- and Semi-Supervised Learning of a DCNN for Semantic Image Segmentation
2015cited by this paper
VQA: Visual Question Answering
2015cited by this paper
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
2015cited by this paper
Weakly supervised semantic segmentation for social images
2015cited by this paper
Stacked What-Where Auto-encoders
2015cited by this paper
Page segmentation of historical document images with convolutional autoencoders
2015cited by this paper
Multi-Scale Context Aggregation by Dilated Convolutions
2015cited by this paper
Deep Residual Learning for Image Recognition
2015cited by this paper
ICDAR2015 competition on recognition of documents with complex layouts - RDCL2015
2015cited by this paper
Deep Unordered Composition Rivals Syntactic Methods for Text Classification
2015cited by this paper
Deep visual-semantic alignments for generating image descriptions
2014cited by this paper
Long-term recurrent convolutional networks for visual recognition and description
2014cited by this paper
Fully convolutional networks for semantic segmentation
2014cited by this paper
Learning from Weak and Noisy Labels for Semantic Segmentation
2014cited by this paper
Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs
2014cited by this paper
Going deeper with convolutions
2014cited by this paper
Microsoft COCO: Common Objects in Context
2014cited by this paper
Efficient Estimation of Word Representations in Vector Space
2013cited by this paper
Distributed Representations of Words and Phrases and their Compositionality
2013cited by this paper
DeViSE: A Deep Visual-Semantic Embedding Model
2013cited by this paper
Document Image Applications
2013cited by this paper
Zero-Shot Learning Through Cross-Modal Transfer
2013cited by this paper
ADADELTA: An Adaptive Learning Rate Method
2012cited by this paper
Document segmentation using Relative Location Features
2012cited by this paper
Improved document image segmentation algorithm using multiresolution morphology
2011cited by this paper
Logical Structure Recovery in Scholarly Articles with Rich Document Features
2010cited by this paper
Extracting and composing robust features with denoising autoencoders
2008cited by this paper
Et al
2008cited by this paper
Supervised Dictionary Learning
2008cited by this paper
Fast Document Segmentation Using Contour and X-Y Cut Technique
2007cited by this paper
An Overview of the Tesseract OCR Engine
2007cited by this paper
Learning nongenerative grammatical models for document analysis
2005cited by this paper
The UvA color document dataset
2005cited by this paper
Document structure analysis algorithms: a literature survey
2003cited by this paper
Page Segmentation and Classification Utilizing Bottom-Up Approach
2001cited by this paper
UW-ISL document image analysis toolbox: an experimental environment
1997cited by this paper
A Fast Algorithm for Bottom-Up Document Layout Analysis
1997cited by this paper
Document page decomposition by the bounding-box project
1995cited by this paper
Support-Vector Networks
1995cited by this paper
Page grammars and page parsing. A syntactic approach to document layout recognition
1993cited by this paper
Syntactic Segmentation and Labeling of Digitized Pages from Technical Journals
1993cited by this paper
A fast and efficient method for extracting text paragraphs and graphics from unconstrained documents
1992cited by this paper
2009 10th International Conference on Document Analysis and Recognition A Realistic Dataset for Performance Evaluation of Document Layout Analysis †
year unknowninfluential reference

CITED BY

Beyond Human Annotation: Recent Advances in Data Generation Methods for Document Intelligence
2026cites this paper
A high-throughput ResNet CNN approach for automated grapevine leaf hair quantification
2025cites this paper
Class-Agnostic Region-of-Interest Matching in Document Images
2025cites this paper
A Survey on MLLM-based Visually Rich Document Understanding: Methods, Challenges, and Emerging Trends
2025cites this paper
Topologically Consistent Prototype Network for Incomplete Multimodal Learning
2025cites this paper
Multimodal Framework for PDF Structure with Heading, Table and Caption Tasks
2025cites this paper
Boosting Document Image Translation via Layout-Aware Semantic Paragraph Clustering
2025cites this paper
OmniDocLayout: Towards Diverse Document Layout Generation via Coarse-to-Fine LLM Learning
2025cites this paper
FlexDoc: Parameterized Sampling for Diverse Multilingual Synthetic Documents for Training Document Understanding Models
2025cites this paper
STRAS: a semantic textual-cues leveraged Rule-based approach for article separation in historical newspapers
2025cites this paper
DREAM: Document Reconstruction via End-to-end Autoregressive Model
2025cites this paper
DeepArabicDoc: A Hybrid CNN-LSTM Framework for Multi-Scale Enhancement and Semantic Analysis of Historical Arabic Manuscripts
2025cites this paper
DocLayout-YOLO: Enhancing Document Layout Analysis through Diverse Synthetic Data and Global-to-Local Adaptive Perception
2024cites this paper
HTR-VT: Handwritten text recognition with vision transformer
2024cites this paper
A machine learning driven automated system for safety data sheet indexing
2024cites this paper
DocMamba: Efficient Document Pre-training with State Space Model
2024cites this paper
Deep Learning based Visually Rich Document Content Understanding: A Survey
2024cites this paper
Document Image Layout Analysis via MASK Constraint
2024influential citation
LayoutFormer: Hierarchical Text Detection Towards Scene Text Understanding
2024cites this paper
Approximate ground truth generation for semantic labeling of historical documents with minimal human effort
2024cites this paper
Detection and Recognition of Table structures from Unstructured Documents
2024cites this paper
SRFUND: A Multi-Granularity Hierarchical Structure Reconstruction Benchmark in Form Understanding
2024cites this paper
Application of large language models based on knowledge graphs in question-answering systems: A review
2024cites this paper
RoDLA: Benchmarking the Robustness of Document Layout Analysis Models
2024cites this paper
Detect-Order-Construct: A Tree Construction based Approach for Hierarchical Document Structure Analysis
2024cites this paper
End-to-end semi-supervised approach with modulated object queries for table detection in documents
2024cites this paper
Information Extraction from Scanned Invoice Documents Using Deep Learning Methods
2024cites this paper
FRFTLR: Layout Analysis for Medical Laboratory Sheet based on Regional Features and Table Lines
2024cites this paper
Template-based text field segmentation for ID documents using dynamic squeezeboxes packing
2024cites this paper
Fine-Grained, Accurate Data Generation and Multimodal Layout Analysis for Academic Papers
2024influential citation
RobustLayoutLM: Leveraging Optimized Layout with Additional Modalities for Improved Document Understanding
2024cites this paper
M2Doc: A Multi-Modal Fusion Approach for Document Layout Analysis
2024cites this paper
UnSupDLA: Towards Unsupervised Document Layout Analysis
2024cites this paper
Document Parsing Unveiled: Techniques, Challenges, and Prospects for Structured Information Extraction
2024cites this paper
Towards End-to-End Semi-Supervised Table Detection with Semantic Aligned Matching Transformer
2024cites this paper
SRRV: A Novel Document Object Detector Based on Spatial-Related Relation and Vision
2023cites this paper
Syntactic Generation of Research Thesis Sketches Across Disciplines Using Formal Grammars
2023cites this paper
Line extraction in handwritten documents via instance segmentation
2023cites this paper
LayerDoc: Layer-wise Extraction of Spatial Hierarchical Structure in Visually-Rich Documents
2023influential citation
Vision Grid Transformer for Document Layout Analysis
2023influential citation
ICDAR 2023 Competition on Hierarchical Text Detection and Recognition
2023cites this paper
Handwritten Paragraph Recognition Using Spatial Information on Russian Notebooks Dataset
2023cites this paper
Hierarchical Text Spotter for Joint Text Spotting and Layout Analysis
2023cites this paper
Language Independent Neuro-Symbolic Semantic Parsing for Form Understanding
2023cites this paper
ITEACH-Net: Inverted Teacher-studEnt seArCH Network for Emotion Recognition in Conversation
2023cites this paper
microConceptBERT: Concept-Relation Based Document Information Extraction Framework
2023cites this paper
NBID Dataset: Towards Robust Information Extraction in Official Documents
2023cites this paper
A Novel Approach for Extracting Key Information from Vietnamese Prescription Images
2023cites this paper
Integrated document segmentation and region identification: textual, equation and graphical
2023cites this paper
Hierarchical Braille Layout Detection Model Based on Semantic Assistance
2023cites this paper
Image Layer Modeling for Complex Document Layout Generation
2023influential citation
Artificial Intelligence in Optical Character Recognition: Technological Aspects and Practical Implementation for Invoice Processing
2023cites this paper
Similarity learning of product descriptions and images using multimodal neural networks
2023cites this paper
TDeLTA: A Light-weight and Robust Table Detection Method based on Learning Text Arrangement
2023cites this paper
Layout Representation Learning with Spatial and Structural Hierarchies
2023cites this paper
Bridging the Performance Gap between DETR and R-CNN for Graphical Object Detection in Document Images
2023cites this paper
DocTr: Document Transformer for Structured Information Extraction in Documents
2023cites this paper
LayoutMask: Enhance Text-Layout Interaction in Multi-modal Pre-training for Document Understanding
2023cites this paper
Generalizability in Document Layout Analysis for Scientific Article Figure & Caption Extraction
2023cites this paper
M6Doc: A Large-Scale Multi-Format, Multi-Type, Multi-Layout, Multi-Language, Multi-Annotation Category Dataset for Modern Document Layout Analysis
2023influential citation
Line Graphics Digitization: A Step Towards Full Automation
2023cites this paper
DocAligner: Annotating Real-world Photographic Document Images by Simply Taking Pictures
2023cites this paper
Layout and Task Aware Instruction Prompt for Zero-shot Document Image Question Answering
2023cites this paper
HiM: hierarchical multimodal network for document layout analysis
2023influential citation
Scientific document processing: challenges for modern learning methods
2023cites this paper
TRACE: Table Reconstruction Aligned to Corner and Edges
2023cites this paper
Entry Separation using a Mixed Visual and Textual Language Model: Application to 19th century French Trade Directories
2023cites this paper
The digitization of historical astrophysical literature with highly localized figures and figure captions
2023influential citation
Towards End-to-End Semi-Supervised Table Detection with Deformable Transformer
2023cites this paper
Multimodal sentiment analysis based on fusion methods: A survey
2023cites this paper
Document Layout Analysis
2023cites this paper
Detecting Drift in Deep Learning: A Methodology Primer
2022cites this paper
Augmentation-based Pseudo-Ground truth Generation for Deep Learning in Historical Document Segmentation for Greater Levels of Archival Description and Access
2022influential citation
Multi-modal text recognition and encryption in scanned document images
2022cites this paper
Development and Evaluation of a Tool for Assisting Content Creators in Making PDF Files More Accessible
2022cites this paper
GCNet: Graph Completion Network for Incomplete Multimodal Learning in Conversation
2022cites this paper
Page Layout Analysis of Text-heavy Historical Documents: a Comparison of Textual and Visual Approaches
2022cites this paper
Transformer-Based Approach for Document Layout Understanding
2022cites this paper
LiLT: A Simple yet Effective Language-Independent Layout Transformer for Structured Document Understanding
2022cites this paper
Multimodal Web Page Segmentation Using Self-organized Multi-objective Clustering
2022cites this paper
A survey of graph neural networks in various learning paradigms: methods, applications, and challenges
2022cites this paper
ERNIE-Layout: Layout-Knowledge Enhanced Multi-modal Pre-training for Document Understanding
2022cites this paper
Segmentation for document layout analysis: not dead yet
2022influential citation
CALM: Commen-Sense Knowledge Augmentation for Document Image Understanding
2022cites this paper
Cross-domain document layout analysis using document style guide
2022influential citation
ERNIE-mmLayout: Multi-grained MultiModal Transformer for Document Understanding
2022cites this paper
ERNIE-Layout: Layout Knowledge Enhanced Pre-training for Visually-rich Document Understanding
2022influential citation
Figure and Figure Caption Extraction for Mixed Raster and Vector PDFs: Digitization of Astronomical Literature with OCR Features
2022cites this paper
ScannerNet: A Deep Network for Scanner-Quality Document Images under Complex Illumination
2022cites this paper
WebFormer: The Web-page Transformer for Structure Information Extraction
2022cites this paper
One-Shot Doc Snippet Detection: Powering Search in Document Beyond Text
2022cites this paper
FormNet: Structural Encoding beyond Sequential Modeling in Form Document Information Extraction
2022cites this paper
Document Layout Analysis Via Positional Encoding
2022cites this paper
STAPI: An Automatic Scraper for Extracting Iterative Title-Text Structure from Web Documents
2022cites this paper
DavarOCR: A Toolbox for OCR and Multi-Modal Document Understanding
2022cites this paper
Classroom Slide Narration System
2022cites this paper
TRIE++: Towards End-to-End Information Extraction from Visually Rich Documents
2022cites this paper
Bi-VLDoc: bidirectional vision-language modeling for visually-rich document understanding
2022cites this paper
DistillAdapt: Source-Free Active Visual Domain Adaptation
2022influential citation
mmLayout: Multi-grained MultiModal Transformer for Document Understanding
2022cites this paper