Machine learning in automated text categorization

Published 2001 in CSUR

ABSTRACT

The automated categorization (or classification) of texts into predefined categories has witnessed a booming interest in the last 10 years, due to the increased availability of documents in digital form and the ensuing need to organize them. In the research community the dominant approach to this problem is based on machine learning techniques: a general inductive process automatically builds a classifier by learning, from a set of preclassified documents, the characteristics of the categories. The advantages of this approach over the knowledge engineering approach (consisting in the manual definition of a classifier by domain experts) are a very good effectiveness, considerable savings in terms of expert labor power, and straightforward portability to different domains. This survey discusses the main approaches to text categorization that fall within the machine learning paradigm. We will discuss in detail issues pertaining to three different problems, namely, document representation, classifier construction, and classifier evaluation.

PUBLICATION RECORD

Publication year
2001
Venue
CSUR
Publication date
2001-10-26
Fields of study
Computer Science
Identifiers
DOI 10.1145/505282.505283 arXiv cs/0110053
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Storage and Retrieval
2002cited by this paper
A Study of Approaches to Hypertext Categorization
2002cited by this paper
Guest Editors' Introduction to the Special Issue on Automated Text Categorization
2002cited by this paper
Hidden Markov Models for Text Categorization in Multi-Page Documents
2002cited by this paper
A learner-independent evaluation of the usefulness of statistical phrases for automated text categorization
2001cited by this paper
Foundations of Statistical Natural Language Processing
2001cited by this paper
Categorization for Multi-page Documents : A Hybrid Naive Bayes HMM Approach
2001cited by this paper
HMM-based passage models for document classification and ranking
2001cited by this paper
Boosting Applied toe Word Sense Disambiguation
2000cited by this paper
An improved boosting algorithm and its application to text categorization
2000cited by this paper
A practical hypertext catergorization method using links and incrementally available class information
2000cited by this paper
Text filtering by boosting naive Bayes classifiers
2000cited by this paper
Boosting Applied to Word Sense Disambiguation
2000cited by this paper
Text Classification from Labeled and Unlabeled Documents using EM
2000cited by this paper
An experimental comparison of naive Bayesian and keyword-based anti-spam filtering with personal e-mail messages
2000cited by this paper
Hierarchical classification of Web content
2000cited by this paper
A Comparative Study of Classification Based Personal E-mail Filtering
2000cited by this paper
BoosTexter: A Boosting-based System for Text Categorization
2000cited by this paper
Text-based approaches for non-topical image categorization
2000cited by this paper
Detecting Concept Drift with Support Vector Machines
2000cited by this paper
Boosting for document routing
2000cited by this paper
Experiments on the Use of Feature Selection and Negative Evidence in Automated Text Categorization
2000cited by this paper
A Boosting Approach to Topic Spotting on Subdialogues
2000cited by this paper
Support vector machines for spam categorization
1999cited by this paper
Text-Based Approaches for the Categorization of Images
1999cited by this paper
A patent search and classification system
1999cited by this paper
Feature reduction for neural network based text categorization
1999cited by this paper
Machine Learning in Automated Text Categorisation
1999cited by this paper
Context-sensitive learning methods for text categorization
1999cited by this paper
Text classification using ESC-based stochastic decision lists
1999cited by this paper
Text Categorization with Support Vector Machines: Learning with Many Relevant Features
1999cited by this paper
Automatic Web Page Categorization by Link and Context Analysis
1999cited by this paper
Exploiting Hierarchy in Text Categorization
1999cited by this paper
A re-examination of text categorization methods
1999cited by this paper
Transductive Inference for Text Classification using Support Vector Machines
1999cited by this paper
Automatic Text Categorization and Its Application to Text Retrieval
1999cited by this paper
Feature Selection in SVM Text Categorization
1999cited by this paper
A probabilistic description-oriented approach for categorizing web documents
1999cited by this paper
Maximizing Text-Mining Performance
1999cited by this paper
Hierarchical neural networks for text categorization (poster abstract)
1999influential reference
Text mining: finding nuggets in mountains of textual data
1999cited by this paper
NEW DIRECTIONS IN TEXT CATEGORIZATION
1999cited by this paper
ATTICS: A Software Platform for Online Text Classification (poster abstract).
1999cited by this paper
A survey of probabilistic models in Information Retrieval
1999cited by this paper
Feature Engineering for Text Classification
1999cited by this paper
An Evaluation of Statistical Approaches to Text Categorization
1999cited by this paper
Mining online text
1999cited by this paper
Hierarchical neural networks for text categorization
1999cited by this paper
Exploiting Structural Information for Text Classification on the WWW
1999cited by this paper
Learnable visual keywords for image classification
1999cited by this paper
Learning to Resolve Natural Language Ambiguities: A Unified Approach
1998cited by this paper
Scalable feature selection, classification and signature generation for organizing large text databases into hierarchical topic taxonomies
1998cited by this paper
Learning to Classify Text from Labeled and Unlabeled Documents
1998cited by this paper
Categorisation by Context
1998cited by this paper
Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text
1998cited by this paper
Distributional clustering of words for text classification
1998influential reference
A new on-line learning algorithm for adaptive text filtering
1998cited by this paper
Boosting and Rocchio applied to text filtering
1998cited by this paper
Automatic Word Sense Discrimination
1998cited by this paper
Classification of text documents
1998cited by this paper
Improving Text Classification by Shrinkage in a Hierarchy of Classes
1998cited by this paper
Joins that Generalize: Text Classification Using WHIRL
1998cited by this paper
Four text classification algorithms compared on a Dutch corpus
1998cited by this paper
Document classification using multiword features
1998cited by this paper
Feature Subset Selection in Text-Learning
1998cited by this paper
Employing Em in Pool-based Active Learning for Text Classiication
1998cited by this paper
Employing EM and Pool-Based Active Learning for Text Classification
1998cited by this paper
Word sequences as features in text-learning
1998cited by this paper
Automatic essay grading using text categorization techniques
1998cited by this paper
Using a generalized instance set for automatic text categorization
1998cited by this paper
A Scalable Self-organizing Map Algorithm for Textual Classification: A Neural Network Approach to Thesaurus Generation
1998cited by this paper
Text classification with self-organizing maps: Some lessons learned
1998cited by this paper
Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval
1998cited by this paper
An experimental evaluation of OCR text representations for learning document classifiers
1998cited by this paper
Integrating linguistic resources in a uniform way for Text classification tasks
1998cited by this paper
Readings in information retrieval.
1998cited by this paper
The TREC-7 Filtering Track: Description and Analysis
1998cited by this paper
Enhanced hypertext categorization using hyperlinks
1998cited by this paper
On the Optimality of the Simple Bayesian Classifier under Zero-One Loss
1997cited by this paper
Hierarchically Classifying Documents Using Very Few Words
1997cited by this paper
Automatic Detection of Text Genre
1997cited by this paper
Feature selection, perceptron learning, and a usability case study for text categorization
1997cited by this paper
A Comparative Study on Feature Selection in Text Categorization
1997cited by this paper
Autonomous document classification for business
1997cited by this paper
A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization
1997cited by this paper
Information Extraction
1997cited by this paper
Active Learning with Committees for Text Categorization
1997cited by this paper
Using WordNet to Complement Training Information in Text Categorization
1997cited by this paper
Exploiting Thesaurus Knowledge in Rule Induction for Text Classification
1997cited by this paper
Using a Bayesian Network Induction Approach for Text Categorization
1997cited by this paper
Mistake-Driven Learning in Text Categorization
1997cited by this paper
Learning routing queries in a query zone
1997cited by this paper
The text categorization system TEKLIS at TREC-6
1997cited by this paper
RELEVANCE: A review of and a framework for the thinking on the notion in information science
1997cited by this paper
Context-sensitive Learning Methods for Text Categorization
1996cited by this paper
Learning Rules that Classify E-Mail
1996cited by this paper
ACTION: automatic classification for full-text documents
1996cited by this paper
Error Correlation and Error Reduction in Ensemble Classifiers
1996cited by this paper
Combining classifiers in text categorization
1996cited by this paper
Training algorithms for linear text classifiers
1996cited by this paper

CITED BY

Operationalization of Machine Learning with Serverless Architecture: An Industrial Operationalization of Machine Learning with Serverless Architecture: An Industrial Implementation for Harmonized System Code Prediction
2026cites this paper
Economic Profile Generation from Textual Data Using an Algorithm Framework
2026cites this paper
The Emergence and Trajectories of the Glocalization Concept (1990–2025)
2026cites this paper
Multi-Class News Classification with BERT, DistilBERT, RoBERTa, and ELECTRA Natural Language Processing Models
2026cites this paper
Multimodal AI Application for Vietnamese Digital Learning Material Classification
2026cites this paper
A comparative evaluation of transformer models for medical abstract classification
2026cites this paper
Automatic classification method of e-commerce commodity raw materials through the introduction of self-supervised concepts and the construction of domain ontology
2026cites this paper
Sustainability-Oriented Institutions and the Success of Green Reward-Based Crowdfunding Campaigns
2026cites this paper
Bridging the research–practice gap in construction contract management with NLP and LLMs
2026cites this paper
OpenITI Tabanlı RAG Sistemi ile Doğal Dil İşleme
2026cites this paper
Exploring the role of emerging technologies in advancing sustainable development goals (SDGs) in developing countries: a comprehensive review
2026cites this paper
Automating the Classification of Economic Activities in Official Statistics: A Comparative Study of Neural Networks and Transformers
2026cites this paper
Taxonomical modeling and classification in space hardware failure reporting
2026cites this paper
Text-Driven Early Warning of Supply Chain Risks: A Hybrid Machine- and Deep-Learning Framework for the New Energy Vehicle (NEV) Industry
2026cites this paper
MER-CAPF: Audio-Text Emotion Recognition through Cross-Attention Mechanism and Multi-Granularity Pooling Strategy
2026cites this paper
Sorgulama Topluluğu çerçevesinde metin sınıflandırması: Yapay sinir ağı temelli bir çalışma
2026cites this paper
Document modelling using topics based on meaningful-interesting patterns for information filtering
2026cites this paper
Fuzzy Improved Distributions for Exceedance Counts in Order Statistic Intervals
2026cites this paper
Cataloguing, Metadata, and Generative AI. Early Experiences and Future Perspectives
2026cites this paper
Learning-based approaches for wireless PHY layers from the perspective of conventional machine learning to foundation models: A comprehensive survey
2026cites this paper
A Survey of Six Classical Classifiers, Including Algorithms, Methodological Characteristics, Foundational Variants, and Recent Advances
2026cites this paper
Multimodal Sentiment Analysis based on Multi-channel and Symmetric Mutual Promotion Feature Fusion
2026cites this paper
Rethinking Scaling Up Content Analysis: A Reappraisal of Justifications and Practices for Large-Scale Content Analysis
2025cites this paper
AI language model applications for early diagnosis of childhood epilepsy based on unstructured first‐visit patient narratives: A cohort study
2025cites this paper
UCOM_UNAM_PLN at CheckThat! 2025: Evaluating LLMs in a two-Step Architecture for Numerical Fact Checking
2025cites this paper
Deep Learning–Based Sentiment and Topic Analysis of Turkish Football Fans on X Platform
2025cites this paper
An efficient network intrusion detection model based on beta mixture models
2025cites this paper
An artificial intelligence assistant to reader response theory: Pioneering novel analysis in the digital age
2025cites this paper
Prediction of drug-induced nephrotoxicity based on deep learning algorithm and molecular fingerprints.
2025cites this paper
Automating classification of veterinary biosecurity recommendations using machine learning.
2025cites this paper
ChatGPT in the Working World
2025cites this paper
Transformer and statistical models for LCSH assignment: a comparative study in digital libraries
2025cites this paper
Vector Similarity-Assisted ChatGPT Approach for Text Classification
2025cites this paper
A Hybrid Framework for Subject Analysis: Integrating Embedding‐Based Regression Models with Large Language Models
2025cites this paper
Multiobjective Project Clustering Optimization for Enhancing Highway Contract Bundling Decisions
2025cites this paper
Scale-Free Characteristics of Multilingual Legal Texts and the Limitations of LLMs
2025cites this paper
Multi-instance multi-label position-aware doubly graph convolutional networks
2025cites this paper
What One Million Prompts Tells Us About AI Usage, Topics, and Preferences
2025cites this paper
Developing and Evaluating a Classification Model for Construction Defect Control: A Text Mining and Ensemble Learning Approach
2025cites this paper
Text Classification with Deep Learning and Transfer Learning: A Survey
2025cites this paper
Unique bioaccumulation and biosynthesis of arsenobetaine in marine fish.
2025cites this paper
A Review of Document Classification Techniques Using Machine Learning and Deep Learning
2025cites this paper
Improving Text Classification of Imbalanced Call Center Conversations Through Data Cleansing, Augmentation, and NER Metadata
2025cites this paper
Detection of fake reviews in social media with F score using novel support vector machine over K-nearest neighbor
2025cites this paper
A Novel Graph Convolutional Text Classification Based on Token-Task Learning
2025cites this paper
Text Classification Using Enhanced Binary Wind Driven Optimization Algorithm
2025cites this paper
A systematic review via text mining approaches of human and veterinary applications of photobiomodulation: focus on multiwave locked system laser therapy
2025cites this paper
Enhancing Requirements Classification Using Machine Learning Techniques
2025cites this paper
Modeling and clustering of heterogeneous multivariate categorical sequences
2025cites this paper
Few-Shot and Zero-Shot Classification with Large Language Models: A Comparative Study
2025cites this paper
QIDLearningLib: A Python library for quasi-identifier recognition and evaluation
2025cites this paper
Video Segmentation and Tokenization for Model-Based Video Scene Classification
2025cites this paper
Optimization of Arabic text classification using SVM integrated with word embedding models on a novel dataset
2025cites this paper
Machine Learning Based Identification of LLM Generated Scientific Research Article Abstracts
2025cites this paper
Spam Detection Using an Advanced Hybrid Model
2025cites this paper
Language as Data: A Survey of Natural Language Processing for Economics and Finance
2025cites this paper
Text-Based Approaches to Item Alignment to Content Standards in Large-Scale Reading & Writing Tests
2025cites this paper
Comparison of Machine Learning Models to Classify Documents on Digital Development
2025cites this paper
RADAr: A Transformer-Based Autoregressive Decoder Architecture for Hierarchical Text Classification
2025influential citation
Ant colony optimization enhanced with ensemble of pheromone vectors using multi-criteria decision making: A case study in multi-label text feature selection
2025cites this paper
Adapting in times of crisis: how social media marketing of gambling changed in response to major shifts in the gambling landscape
2025cites this paper
Order from Chaos: Comparative Study of Ten Leading LLMs on Unstructured Data Categorization
2025cites this paper
SNERC: Enhancing Knowledge Management with Named Entity Recognition and Document Classification for Apply Gaming
2025cites this paper
Mapping the “X” Debate: Water Fluoridation Sentiment Analysis With Advanced Machine Learning
2025cites this paper
BERT-SVM: A hybrid BERT and SVM method for semantic similarity matching evaluation of paired short texts in English teaching
2025cites this paper
A label-guided contrastive capsule network for few-shot text classification
2025cites this paper
Balanced Knowledge Transfer in MTTL-ClinicalBERT: A Symmetrical Multi-Task Learning Framework for Clinical Text Classification
2025cites this paper
Semi-Automated Training of AI Vision Models
2025cites this paper
How Effective are Generative Large Language Models in Performing Requirements Classification?
2025cites this paper
LEUSIM: A Lightweight Extendable User SIMulator for Testing Commercial Task-Oriented Dialogue System
2025cites this paper
Machine learning assessment of zoonotic potential in avian influenza viruses using PB2 segment
2025cites this paper
Dual-granularity multi-instance multi-label learning with variational autoencoder
2025cites this paper
Cyberbullying detection and blocking using machine learning
2025cites this paper
Text mining and topic modeling insights on fish welfare and antimicrobial use in aquaculture
2025cites this paper
Comparing large Language models and human annotators in latent content analysis of sentiment, political leaning, emotional intensity and sarcasm
2025cites this paper
IoT-AID: Leveraging XAI for Conversational Recommendations in Cyber-Physical Systems
2025cites this paper
Perception and argumentation in the LK-99 superconductivity controversy: a sentiment and argument mining analysis
2025cites this paper
Dual-channel knowledge attention for traditional Chinese medicine syndrome differentiation
2025cites this paper
Hierarchical deep learning for multi-label imbalanced text classification of economic literature
2025cites this paper
The AI Co-Ethnographer: How Far Can Automation Take Qualitative Research?
2025cites this paper
Development of machine learning-based mpox surveillance models in a learning health system
2025cites this paper
Efficient multimodal learning for corporate credit risk prediction with an extended deep belief network
2025cites this paper
Long Document Classification in the Transformer Era: A Survey on Challenges, Advances, and Open Issues
2025cites this paper
Exploring the development of Slovenian sociological science: ontology analysis of scientific bibliographical data
2025cites this paper
An NLP Approach to Efficient Duplicate Question Detection using Neural Networks and TF-IDF
2025cites this paper
EDCIN: enhanced dual-channel interaction network for multi-label text classification
2025cites this paper
MysticCIOL@DravidianLangTech 2025: A Hybrid Framework for Sentiment Analysis in Tamil and Tulu Using Fine-Tuned SBERT Embeddings and Custom MLP Architectures
2025cites this paper
Machine learning-enabled optoelectronic material discovery: a comprehensive review
2025cites this paper
Enhancing Twitter Sentiment Classification with a Hybrid Bio-Inspired Feature Selection Approach
2025cites this paper
A transfer learning-enhanced deep learning framework for efficient and interpretable soil heavy metal pollution prediction under data scarcity and spatial heterogeneity.
2025cites this paper
A Hybrid Approach for Predicting Depression Using Deep Learning
2025cites this paper
Komplo Teorilerinin Etki Potansiyelini Anlama: Texe Marrs’ın Illuminati Örneği
2025cites this paper
Identifying female body shapes and key measurements using supervised machine learning algorithms
2025cites this paper
Multimodal Approach for Lung Disease Classification: Fusing Chest X-Ray Images and Clinical Texts
2025cites this paper
Multilabel-Thai Text Classification with Transformer-Rnn in Thai Banking Classification
2025cites this paper
Automatic Subject Descriptions of Polish Library and Information Science Articles: A Comparison of DESCRIPTOR E-Service and GPT-4o
2025cites this paper
Comparative Analysis of BERT and GPT for Classifying Crisis News with Sudan Conflict as an Example
2025cites this paper
SAGE: A Context-Aware Approach for Mining Privacy Requirements Relevant Reviews from Mental Health Apps
2025cites this paper
AI-supported cataloger: a deep dive into intelligent document classification
2025cites this paper
Leveraging LLaMA2 for improved document classification in English
2025cites this paper