Trafilatura: A Web Scraping Library and Command-Line Tool for Text Discovery and Extraction

Published 2021 in Annual Meeting of the Association for Computational Linguistics

ABSTRACT

An essential operation in web corpus construction consists in retaining the desired content while discarding the rest. Another challenge finding one’s way through websites. This article introduces a text discovery and extraction tool published under open-source license. Its installation and use is straightforward, notably from Python and on the command-line. The software allows for main text, comments and metadata extraction, while also providing building blocks for web crawling tasks. A comparative evaluation on real-world data also shows its interest as well as the performance of other available solutions. The contributions of this paper are threefold: it references the software, features a benchmark, and provides a meaningful baseline for similar tasks. The tool performs significantly better than other open-source solutions in this evaluation and in external benchmarks.

PUBLICATION RECORD

Publication year
2021
Venue
Annual Meeting of the Association for Computational Linguistics
Publication date
2021-08-01
Fields of study
Computer Science
Identifiers
DOI 10.18653/v1/2021.acl-demo.15
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Language ID in the Wild: Unexpected Challenges on the Path to a Thousand-Language Web Text Corpus
2020cited by this paper
Out-of-the-Box and into the Ditch? Multilingual Evaluation of Generic Text Extraction Tools
2020cited by this paper
htmldate: A Python package to extract publication dates from web pages
2020cited by this paper
From Web Crawl to Clean Register-Annotated Corpora
2020cited by this paper
First Results of the TurkLang-7 Project: Creating Russian-Turkic Parallel Corpora and MT Systems
2020cited by this paper
Bien choisir son outil d’extraction de contenu à partir du Web (Choosing the appropriate tool for Web Content Extraction )
2020cited by this paper
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
2020cited by this paper
Swiss-AL: A Multilingual Swiss Web Corpus for Applied Linguistics
2020cited by this paper
What Web Template Extractor Should I Use? A Benchmarking and Comparison for Five Template Extractors
2019cited by this paper
Computación y Sistemas, Vol. 22, No. 4, 2018
2019cited by this paper
Web2Text: Deep Structured Boilerplate Removal
2018cited by this paper
SpeedReader: Reader Mode Made Fast and Private
2018cited by this paper
news-please - A Generic News Crawler and Extractor
2017cited by this paper
Die Korpusplattform des „Digitalen Wörterbuchs der deutschen Sprache“ (DWDS)
2017cited by this paper
Accurate and efficient general-purpose boilerplate detection for crawled web corpora
2016cited by this paper
Efficient construction of metadata-enhanced web corpora
2016cited by this paper
HTML web content extraction using paragraph tags
2016cited by this paper
C4Corpus: Multilingual Web-size Corpus with Free License
2016cited by this paper
Two Years of Aranea: Increasing Counts and Tuning the Pipeline
2016cited by this paper
CommonCOW: Massively Huge Web Corpora from CommonCrawl Data and a Method to Distribute them Freely under Restrictive EU Copyright Laws
2016cited by this paper
Construction de corpus généraux et spécialisés à partir du Web (Ad hoc and general-purpose corpus construction from web sources)
2015cited by this paper
For a fistful of blogs: Discovery and comparative benchmarking of republishable German content
2014cited by this paper
N-gram Counts and Language Models from the Common Crawl
2014cited by this paper
Web Crawling
2014cited by this paper
The PAISÀ Corpus of Italian Web Texts
2014cited by this paper
The Good, the Bad, and the Hazy: Design Decisions in Web Corpus Construction
2013cited by this paper
Content extraction using diverse feature sets
2013influential reference
Content Extraction from News Pages Using Particle Swarm Optimization
2012cited by this paper
Hybrid model of content extraction
2012cited by this paper
Building Large Corpora from the Web Using a New Efficient Tool Chain
2012cited by this paper
DOM based content extraction via text density
2011cited by this paper
Removing Boilerplate and Duplicate Content from Web Corpora
2011cited by this paper
CETR: content extraction via tag ratios
2010cited by this paper
ECON: An Approach to Extract Content from Web News Page
2010cited by this paper
Data Collection and IPR in Multilingual Parallel Corpora. Dutch Parallel Corpus
2010cited by this paper
Boilerplate detection using shallow text features
2010cited by this paper
News article extraction with template-independent wrapper
2009cited by this paper
Extracting article text from the web with maximum subsequence segmentation
2009cited by this paper
The WaCky wide web: a collection of very large linguistically processed web-crawled corpora
2009cited by this paper
Victor: the Web-Page Cleaning Tool
2008cited by this paper
Cleaneval: a Competition for Cleaning Web Pages
2008cited by this paper
The Crúbadán Project: Corpus building for under-resourced languages
2007cited by this paper
Correct your text with Google
2007cited by this paper
Leave a Reply: An Analysis of Weblog Comments
2006cited by this paper
Mining Web informative structures and contents based on entropy analysis
2004cited by this paper
Automatically collecting, monitoring, and mining japanese weblogs
2004cited by this paper
VIPS: a Vision-based Page Segmentation Algorithm
2003cited by this paper
BlogPulse: Automated Trend Discovery for Weblogs
2003cited by this paper
Eliminating noisy information in Web pages for data mining
2003cited by this paper
RoadRunner: Towards Automatic Data Extraction from Large Web Sites
2001cited by this paper
Fact or Fiction: Content Classification for Digital Libraries
2001cited by this paper

CITED BY

Disentangling Technical and Content Attributes in Search Engine Ranking: A Comparative Study of Google and Bing
2026cites this paper
AgentIR: Reasoning-Aware Retrieval for Deep Research Agents
2026cites this paper
Raising Bars, Not Parameters: LilMoo Compact Language Model for Hindi
2026cites this paper
OwlerLite: Scope- and Freshness-Aware Web Retrieval for LLM Assistants
2026cites this paper
Shift your Focus for the Greater Good: Improving Fairness at no cost for Accuracy and Diversity in News Recommender Systems
2026cites this paper
Beyond a Single Extractor: Re-thinking HTML-to-Text Extraction for LLM Pretraining
2026cites this paper
Sub-City Real Estate Price Index Forecasting at Weekly Horizons Using Satellite Radar and News Sentiment
2026cites this paper
Pipeline NLP End-to-End untuk Peringkasan Abstraktif dan Ekstraksi Entitas Berita Berbahasa Indonesia Berbasis Model Transformer
2026cites this paper
Predicting inter-state cyberattacks with graph-text fusion using graph neural networks and large language models
2026cites this paper
The Automatic Verification of Image-Text Claims (AVerImaTeC) Shared Task
2026cites this paper
Data Science and Technology Towards AGI Part I: Tiered Data Management
2026cites this paper
IRB: Automated Generation of Robust Factuality Benchmarks
2026cites this paper
RE-TRAC: REcursive TRAjectory Compression for Deep Search Agents
2026cites this paper
AXE: Low-Cost Cross-Domain Web Structured Information Extraction
2026cites this paper
MASS: Mathematical Data Selection via Skill Graphs for Pretraining Large Language Models
2025cites this paper
Summarization Metrics for Spanish and Basque: Do Automatic Scores and LLM-Judges Correlate with Humans?
2025cites this paper
SEA-LION: Southeast Asian Languages in One Network
2025cites this paper
Register Always Matters: Analysis of LLM Pretraining Data Through the Lens of Language Variation
2025cites this paper
WebLists: Extracting Structured Information From Complex Interactive Websites Using Executable LLM Agents
2025cites this paper
Redefining OSINT Software Architecture With System-Centric Architecture Design: A Framework Shaped by QAW, ADD, and ATAM
2025cites this paper
Aleph-Alpha-GermanWeb: Improving German-language LLM pre-training with model-based data curation and synthetic data generation
2025cites this paper
MiMo: Unlocking the Reasoning Potential of Language Model - From Pretraining to Posttraining
2025cites this paper
AutoClean: LLMs Can Prepare Their Training Corpus
2025influential citation
Automatic Webpage Content Extraction Based on Structural and Semantic Features
2025cites this paper
NileChat: Towards Linguistically Diverse and Culturally Aware LLMs for Local Communities
2025cites this paper
SealQA: Raising the Bar for Reasoning in Search-Augmented Language Models
2025cites this paper
NLP-based techniques for Cyber Threat Intelligence
2025cites this paper
dots.llm1 Technical Report
2025cites this paper
PCoT: Persuasion-Augmented Chain of Thought for Detecting Fake News and Social Media Disinformation
2025cites this paper
Deep Research Bench: Evaluating AI Web Research Agents
2025cites this paper
Multilingual Evaluation of Main Content Extractors for Web Pages
2025cites this paper
Mangosteen: An Open Thai Corpus for Language Model Pretraining
2025cites this paper
The 2nd Automated Verification of Textual Claims (AVeriTeC) Shared Task: Open-weights, Reproducible and Efficient Systems
2025cites this paper
MegaWika 2: A More Comprehensive Multilingual Collection of Articles and their Sources
2025cites this paper
BrowseComp-Plus: A More Fair and Transparent Evaluation Benchmark of Deep-Research Agent
2025cites this paper
On the Path to Make Ukrainian a High-Resource Language
2025cites this paper
DocHPLT: A Massively Multilingual Document-Level Translation Dataset
2025cites this paper
NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model
2025cites this paper
Research Trends in Web Crawling, Machine Learning, and SDGs in the Context of the Impact
2025cites this paper
Benchmarking Information Retrieval Models on Complex Retrieval Tasks
2025cites this paper
Multimodal Large Language Model for Out-of-Context Problems in Fake News Detection
2025cites this paper
A Mechanism-Aware Dual Attention Deep Model for Molecular-Protein Binding Affinity Prediction with Enhanced Generalizability and Interpretability
2025cites this paper
Comparative Analysis of Natural Language Processing Techniques in the Classification of Press Articles
2025cites this paper
FactCellar: An Evidence-based Dataset for Automated Fact-Checking
2025cites this paper
Dripper: Token-Efficient Main HTML Extraction with a Lightweight LM
2025cites this paper
Preprint: Did I Just Browse A Website Written by LLMs?
2025cites this paper
LongCat-Flash Technical Report
2025cites this paper
Poster: Did I Just Browse A Website Written by LLMs?
2025cites this paper
Blu-WERP (Web Extraction and Refinement Pipeline): A Scalable Pipeline for Preprocessing Large Language Model Datasets
2025influential citation
AICC: Parse HTML Finer, Make Models Better - A 7.3T AI-Ready Corpus Built by a Model-Based HTML Parser
2025cites this paper
A Low-Noise Web Content Extraction Framework for LLM Data Pipelines: Integrating XGBoost Classification and Reverse Coloring
2025cites this paper
DiNaM: Disinformation Narrative Mining with Large Language Models
2025cites this paper
HPLT 3.0: Very Large-Scale Multilingual Resources for LLM and MT. Mono- and Bi-lingual Data, Multilingual Evaluation, and Pre-Trained Models
2025influential citation
WAON: Large-Scale and High-Quality Japanese Image-Text Pair Dataset for Vision-Language Models
2025cites this paper
M4FC: a Multimodal, Multilingual, Multicultural, Multitask Real-World Fact-Checking Dataset
2025cites this paper
A Survey on LLM Mid-training
2025cites this paper
AI use in American newspapers is widespread, uneven, and rarely disclosed
2025cites this paper
so much depends / upon / a whitespace: Why Whitespace Matters for Poets and LLMs
2025cites this paper
Tahakom LLM Guidelines and Recipes: From Pre-training Data to an Arabic LLM
2025cites this paper
SCRIBES: Web-Scale Script-Based Semi-Structured Data Extraction with Reinforcement Learning
2025cites this paper
Knowledge Extraction on Semi-Structured Content: Does It Remain Relevant for Question Answering in the Era of LLMs?
2025cites this paper
PledgeTracker: A System for Monitoring the Fulfilment of Pledges
2025cites this paper
Prior-based Noisy Text Data Filtering: Fast and Strong Alternative For Perplexity
2025cites this paper
Nemotron-CC-Math: A 133 Billion-Token-Scale High Quality Math Pretraining Dataset
2025cites this paper
How partisan news outlets frame vested interests in climate change.
2025cites this paper
Building High-Quality Datasets for Portuguese LLMs: From Common Crawl Snapshots to Industrial-Grade Corpora
2025cites this paper
Extraire des données textuelles pour l’analyse du discours : le Détricoteur
2025cites this paper
Multilingual Attribute Extraction from News Web Pages
2025influential citation
GERMA: a comprehensive corpus of untrustworthy German news
2025cites this paper
Large language models for software vulnerability detection: a guide for researchers on models, methods, techniques, datasets, and metrics
2025cites this paper
TituLLMs: A Family of Bangla LLMs with Comprehensive Benchmarking
2025cites this paper
Tag-Pag: A Dedicated Tool for Systematic Web Page Annotations
2025cites this paper
olmOCR: Unlocking Trillions of Tokens in PDFs with Vision Language Models
2025cites this paper
CyLLM-DAP: Cybersecurity Domain-Adaptive Pre-Training Framework of Large Language Models
2025cites this paper
An Expanded Massive Multilingual Dataset for High-Performance Language Technologies
2025cites this paper
The ongoing birth of the narrator: empirical evidence for the emergence of the author-narrator distinction in literary criticism
2025influential citation
Closed Domain Question-Answering Techniques in an Institutional Chatbot
2024cites this paper
A Big Data Architecture for Early Identification and Categorization of Dark Web Sites
2024cites this paper
Dumviri: Detecting Trackers and Mixed Trackers with a Breakage Detector
2024cites this paper
Cleaner Pretraining Corpus Curation with Neural Web Scraping
2024cites this paper
Framing in the Presence of Supporting Data: A Case Study in U.S. Economic News
2024cites this paper
WanJuan-CC: A Safe and High-Quality Open-sourced English Webtext Dataset
2024cites this paper
Exploring the potential for online data sources to enhance species threat mapping through the case study of global bat exploitation
2024cites this paper
IndicLLMSuite: A Blueprint for Creating Pre-training and Fine-Tuning Datasets for Indian Languages
2024cites this paper
FRAPPE: FRAming, Persuasion, and Propaganda Explorer
2024influential citation
Fundus: A Simple-to-Use News Scraper Optimized for High Quality Extractions
2024cites this paper
Bailong: Bilingual Transfer Learning based on QLoRA and Zip-tie Embedding
2024cites this paper
Building a Large Japanese Web Corpus for Large Language Models
2024cites this paper
GPT-SW3: An Autoregressive Language Model for the Scandinavian Languages
2024cites this paper
Shared Task for Cross-lingual Classification of Corporate Social Responsibility (CSR) Themes and Topics
2024cites this paper
Advancing CSR Theme and Topic Classification: LLMs and Training Enhancement Insights
2024cites this paper
TELP – Text Extraction with Linguistic Patterns
2024cites this paper
SECURE: Benchmarking Large Language Models for Cybersecurity
2024cites this paper
OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text
2024influential citation
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale
2024cites this paper
Mining Reasons For And Against Vaccination From Unstructured Data Using Nichesourcing and AI Data Augmentation
2024cites this paper
Automatic Generation of Web Censorship Probe Lists
2024cites this paper
AcawebAgent: A Large Language Model-Powered Assistant for Early Academic Research
2024cites this paper
“Image, Tell me your story!” Predicting the original meta-context of visual misinformation
2024cites this paper
Combining Objective and Subjective Perspectives for Political News Understanding
2024cites this paper