A Collaborative Framework for Structure Identification over Print Documents

Published 2019 in HILDA@SIGMOD

ABSTRACT

We describe Texture, a framework for data extraction over print documents that allows end-users to construct data extraction rules over an inferred document structure. To effectively infer this structure, we enable developers to contribute multiple heuristics that identify different structures in English print documents, crowd-workers and annotators to manually label these structures, and end-users to search and decide which heuristics to apply and how to boost their performance with the help of ground-truth data collected from crowd-workers and annotators. Texture's design supports each of these different user groups through a suite of tools. We demonstrate that even with a handful of student-developed heuristics, we can achieve reasonable precision and recall when identifying structures across different document collections.

PUBLICATION RECORD

Publication year
2019
Venue
HILDA@SIGMOD
Publication date
2019-07-05
Fields of study
Computer Science
Identifiers
DOI 10.1145/3328519.3329131
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Interactive Repair of Tables Extracted from PDF Documents on Mobile Devices
2019cited by this paper
Snorkel: Rapid Training Data Creation with Weak Supervision
2017influential reference
Fonduer: Knowledge Base Construction from Richly Formatted Data
2017cited by this paper
SEER: Auto-Generating Information Extraction Rules from User-Specified Examples
2017cited by this paper
Multi-Scale Multi-Task FCN for Semantic Page Segmentation and Table Detection
2017cited by this paper
PDFFigures 2.0: Mining figures from research papers
2016cited by this paper
Looking Beyond Text: Extracting Figures, Tables and Captions from Computer Science Papers
2015cited by this paper
TEXUS: A Task-based Approach for Table Extraction and Understanding
2015cited by this paper
FlashExtract: a framework for data extraction by examples
2014cited by this paper
Automatic web spreadsheet data extraction
2013cited by this paper
Understanding Tables in Context Using Standard NLP Toolkits
2013cited by this paper
Shreddr: pipelined paper digitization for low-resource organizations
2012influential reference
DeepDive: Web-scale Knowledge-base Construction using Statistical Learning and Inference
2012cited by this paper
Extracting general lists from web documents: a hybrid approach
2011cited by this paper
A Table Detection Method for Multipage PDF Documents via Visual Seperators and Tabular Structures
2011cited by this paper
Wrangler: interactive visual specification of data transformation scripts
2011cited by this paper
Structure extraction from PDF-based book documents
2011cited by this paper
SystemT: An Algebraic Approach to Declarative Information Extraction
2010cited by this paper
Information extraction by finding repeated structure
2010cited by this paper
Geometric Layout Analysis Techniques for Document Image Understanding: a Review
2008cited by this paper
Automatic extraction of titles from general documents using machine learning
2005cited by this paper
Extraction,layout analysis and classification of diagrams in PDF documents
2003cited by this paper
Table extraction using conditional random fields
2003cited by this paper
Document structure analysis algorithms: a literature survey
2003cited by this paper
A brief survey of web data extraction tools
2002cited by this paper
Automatic Data Extraction from Lists and Tables in Web Sources
2001cited by this paper
Layout and Language: Integrating Spatial and Linguistic Knowledge for Layout Understanding Tasks
2000cited by this paper
Extraction Patterns for Information Extraction Tasks : A Survey
1999cited by this paper
Logical structure analysis of book document images using contents information
1997cited by this paper

CITED BY

No citing papers are available for this paper.