AutoScraper: A Progressive Understanding Web Agent for Web Scraper Generation

Wenhao Huang,Zhouhong Gu,C.A.I. Peng,Jiaqing Liang,Zhixu Li,Yanghua Xiao,Liqian Wen,Zulong Chen

Published 2024 in Conference on Empirical Methods in Natural Language Processing

ABSTRACT

Web scraping is a powerful technique that extracts data from websites, enabling automated data collection, enhancing data analysis capabilities, and minimizing manual data entry efforts. Existing methods, wrappers-based methods suffer from limited adaptability and scalability when faced with a new website, while language agents, empowered by large language models (LLMs), exhibit poor reusability in diverse web environments. In this work, we introduce the paradigm of generating web scrapers with LLMs and propose AutoScraper, a two-stage framework that can handle diverse and changing web environments more efficiently. AutoScraper leverages the hierarchical structure of HTML and similarity across different web pages for generating web scrapers. Besides, we propose a new executability metric for better measuring the performance of web scraper generation tasks. We conduct comprehensive experiments with multiple LLMs and demonstrate the effectiveness of our framework. Our work is now open-source.

PUBLICATION RECORD

Publication year
2024
Venue
Conference on Empirical Methods in Natural Language Processing
Publication date
2024-04-19
Fields of study
Computer Science
Identifiers
DOI 10.18653/v1/2024.emnlp-main.141 arXiv 2404.12753
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

DeepSeek-Coder: When the Large Language Model Meets Programming - The Rise of Code Intelligence
2024cited by this paper
HeaP: Hierarchical Policies for Web Actions using LLMs
2023cited by this paper
LASER: LLM Agent with State-Space Exploration for Web Navigation
2023cited by this paper
Self-Training for Label-Efficient Information Extraction from Semi-Structured Web-Pages
2023cited by this paper
Code Llama: Open Foundation Models for Code
2023cited by this paper
A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis
2023cited by this paper
Llama 2: Open Foundation and Fine-Tuned Chat Models
2023cited by this paper
Synapse: Trajectory-as-Exemplar Prompting with Memory for Computer Control
2023cited by this paper
Hierarchical Prompting Assists Large Language Model on Web Navigation
2023cited by this paper
WebIE: Faithful and Robust Information Extraction on the Web
2023cited by this paper
Reflexion: language agents with verbal reinforcement learning
2023influential reference
Self-Refine: Iterative Refinement with Self-Feedback
2023cited by this paper
Teaching Large Language Models to Self-Debug
2023cited by this paper
WebFormer: The Web-page Transformer for Structure Information Extraction
2022influential reference
WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents
2022cited by this paper
MarkupLM: Pre-training of Text and Markup Language for Visually Rich Document Understanding
2021influential reference
SASSCAL WebSAPI: A Web Scraping Application Programming Interface to Support Access to SASSCAL’s Weather Data
2021cited by this paper
Simplified DOM Trees for Transferable Attribute Extraction from the Web
2021influential reference
FreeDOM: A Transferable Neural Architecture for Structured Information Extraction on Web Documents
2020influential reference
OpenCeres: When Open Information Extraction Meets the Semi-Structured Web
2019influential reference
CERES: Distantly Supervised Relation Extraction from the Semi-Structured Web
2018cited by this paper
World of Bits: An Open-Domain Platform for Web-Based Agents
2017cited by this paper
Synthesis of Forgiving Data Extractors
2017cited by this paper
Extraction and Integration of Partially Overlapping Web Sources
2013cited by this paper
From one tree to a forest: a unified solution for structured web data extraction
2011influential reference
Automatic Wrappers for Large Scale Web Extraction
2011cited by this paper
Web-scale information extraction with vertex
2011cited by this paper
URL Rule Based Focused Crawler
2008cited by this paper
Wrapper Induction for Information Extraction
1997cited by this paper

CITED BY

The AI Committee: A Multi-Agent Framework for Automated Validation and Remediation of Web-Sourced Data
2025cites this paper
Disrupting Large Language Models with Hidden Prompt Injection Attacks Embedded in HTML Pages
2025cites this paper
Throttling Web Agents Using Reasoning Gates
2025cites this paper
Agente Autônomo Guiado por LLM para Extração de Notícias
2025cites this paper
Symbiotic Cooperation for Web Agents: Harnessing Complementary Strengths of Large and Small LLMs
2025cites this paper
Beyond Text: Characterizing Domain Expert Needs in Document Research
2025cites this paper
AutoKB: Automated Creation of Structured Knowledge Bases for Domain-Specific Support
2025cites this paper
Web Page Classification using LLMs for Crawling Support
2025cites this paper
AutoData: A Multi-Agent System for Open Web Data Collection
2025influential citation
R2D2: Remembering, Replaying and Dynamic Decision Making with a Reflective Agentic Memory
2025cites this paper
Automatic XPath generation agents for vertical websites by LLMs
2025influential citation
Automating XPath Query Generation Using NLP for Streamlined Web Crawling and GUI Testing
2025cites this paper
Evaluation of LLM-based Strategies for the Extraction of Food Product Information from Online Shops
2025cites this paper
Systematic Categorization, Construction and Evaluation of New Attacks against Multi-modal Mobile GUI Agents
2024influential citation
Adaptive Reinforcement Learning Planning: Harnessing Large Language Models for Complex Information Extraction
2024cites this paper
XPath Agent: An Efficient XPath Programming Agent Based on LLM for Web Crawler
2024cites this paper