Distributed Representations of Sentences and Documents

Published 2014 in International Conference on Machine Learning

ABSTRACT

Many machine learning algorithms require the input to be represented as a fixed-length feature vector. When it comes to texts, one of the most common fixed-length features is bag-of-words. Despite their popularity, bag-of-words features have two major weaknesses: they lose the ordering of the words and they also ignore semantics of the words. For example, "powerful," "strong" and "Paris" are equally distant. In this paper, we propose Paragraph Vector, an unsupervised algorithm that learns fixed-length feature representations from variable-length pieces of texts, such as sentences, paragraphs, and documents. Our algorithm represents each document by a dense vector which is trained to predict words in the document. Its construction gives our algorithm the potential to overcome the weaknesses of bag-of-words models. Empirical results show that Paragraph Vectors outperforms bag-of-words models as well as other techniques for text representations. Finally, we achieve new state-of-the-art results on several text classification and sentiment analysis tasks.

PUBLICATION RECORD

Publication year
2014
Venue
International Conference on Machine Learning
Publication date
2014-05-16
Fields of study
Computer Science
Identifiers
arXiv 1405.4053
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Recursive deep learning for natural language processing and computer vision
2014cited by this paper
Multi-Step Regression Learning for Compositional Distributional Semantics
2013influential reference
Reasoning With Neural Tensor Networks for Knowledge Base Completion
2013influential reference
DeViSE: A Deep Visual-Semantic Embedding Model
2013influential reference
Bilingual Word Embeddings for Phrase-Based Machine Translation
2013influential reference
Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank
2013cited by this paper
Modeling Documents with Deep Boltzmann Machines
2013cited by this paper
Exploiting Similarities among Languages for Machine Translation
2013influential reference
Combining Heterogeneous Models for Measuring Relational Similarity
2013cited by this paper
Distributed Representations of Words and Phrases and their Compositionality
2013influential reference
Efficient Estimation of Word Representations in Vector Space
2013influential reference
Linguistic Regularities in Continuous Space Word Representations
2013influential reference
Training Restricted Boltzmann Machines on Word Observations
2012influential reference
Improving Word Representations via Global Context and Multiple Word Prototypes
2012cited by this paper
Statistical Language Models Based on Neural Networks
2012cited by this paper
Baselines and Bigrams: Simple, Good Sentiment and Topic Classification
2012influential reference
A Neural Autoregressive Topic Model
2012cited by this paper
Semi-Supervised Recursive Autoencoders for Predicting Sentiment Distributions
2011influential reference
Compositional Matrix-Space Models for Sentiment Analysis
2011influential reference
Dynamic Pooling and Unfolding Recursive Autoencoders for Paraphrase Detection
2011influential reference
Learning Word Vectors for Sentiment Analysis
2011influential reference
Natural Language Processing (Almost) from Scratch
2011cited by this paper
Parsing Natural Scenes and Natural Language with Recursive Neural Networks
2011cited by this paper
Large-scale image retrieval with compressed Fisher vectors
2010cited by this paper
Word Representations: A Simple and General Method for Semi-Supervised Learning
2010influential reference
From Frequency to Meaning: Vector Space Models of Semantics
2010cited by this paper
Estimating Linear Models for Compositional Distributional Semantics
2010influential reference
Composition in Distributional Models of Semantics
2010cited by this paper
A unified architecture for natural language processing: deep neural networks with multitask learning
2008influential reference
A Scalable Hierarchical Distributed Language Model
2008cited by this paper
Fisher Kernels on Visual Vocabularies for Image Categorization
2007cited by this paper
Neural Probabilistic Language Models
2006influential reference
Hierarchical Probabilistic Neural Network Language Model
2005cited by this paper
Seeing Stars: Exploiting Class Relationships for Sentiment Categorization with Respect to Rating Scales
2005cited by this paper
Accurate Unlexicalized Parsing
2003cited by this paper
A neural probabilistic language model
2003cited by this paper
Exploiting Generative Models in Discriminative Classifiers
1998cited by this paper
Finding Structure in Time
1990cited by this paper
Learning representations by back-propagation errors, nature
1986influential reference
Learning representations by back-propagating errors
1986cited by this paper
Distributional Structure
1954cited by this paper

CITED BY

A Review of State-of-the-Art Deep Learning Models for Knowledge Graphs
2026cites this paper
Framing the COVID-19 Crisis: Analyzing State Media Strategies in China Using Deep Learning Models
2026cites this paper
Measuring technological similarity in the wine industry
2026cites this paper
Privacy-preserving graph similarity search with attribute-based access control
2026cites this paper
Computing Patient Similarity Based on Unstructured Clinical Notes
2026cites this paper
EdgeSim: Firmware vulnerability detection with control transfer-enhanced binary code similarity detection
2026cites this paper
Large language models in 6G from standard to on-device networks
2026cites this paper
Evaluating the impact of word embeddings on similarity scoring in practical information retrieval
2026influential citation
LIDS: LLM Summary Inference Under the Layered Lens
2026cites this paper
From vectors to knowledge graphs: A comprehensive analysis of modern retrieval-augmented generation architectures
2026cites this paper
Multimodal Analysis for Depression Recognition Using Stacked Multilevel Deep Neural Networks
2026cites this paper
Technology foresight in China's industrial robotics with MLWS-TF: A machine learning and weak signal-based system
2026cites this paper
A Hybrid Embedding Method for Identifying Technology Evolution Paths of Patents: The Case on Battery Electric Vehicle Industry
2026cites this paper
Missing Value Imputation in Tabular Data Lakes Unleashed: A Hybrid Approach
2026cites this paper
Detecting Malicious Packages in PyPI and NPM by Clustering Installation Scripts
2026cites this paper
Breaking barriers in academic communication: Insights from a novel face-to-face interaction tracking app at an international conference
2026cites this paper
TextTAGC: a query-oriented scientific paper recommendation model based on temporal-aware graph convolution
2026cites this paper
PIDSMaker: Building and Evaluating Provenance-based Intrusion Detection Systems
2026cites this paper
Perspectives - Interactive Document Clustering in the Discourse Analysis Tool Suite
2026cites this paper
VectorMaton: Efficient Vector Search with Pattern Constraints via an Enhanced Suffix Automaton
2026cites this paper
PVminer: A Domain-Specific Tool to Detect the Patient Voice in Patient Generated Data
2026cites this paper
Multiscore, a gene ranker powered by artificial intelligence and real-world clinical data, shows high sensitivity for the molecular diagnosis of Mendelian disorders in nearly 10,000 exomes and genomes
2026cites this paper
Unveiling emerging communities: a network approach on transport decarbonisation technology
2026cites this paper
Efficient Time-Restricted kNN Search in High-Dimensional Data Using Multi-Level Block Indexing, with Extensions to Multi-Attribute Filtering
2026cites this paper
Passive IoT Device Fingerprinting With Polynomial Curve Fitting and Siamese Neural Networks
2026cites this paper
From prediction to prioritization: Automated framework for delay risk management in highway earthwork projects
2026cites this paper
Multimodal sentiment analysis based on label semantic guidance under social links
2026cites this paper
A3BRec: A Novel Association-Integration Sequential Basket Recommendation
2026cites this paper
Curator: Efficient Vector Search with Low-Selectivity Filters
2026cites this paper
TASDF-Stega: High Capacity Secure Text-Audio Joint Steganography Using Diffusion Latent Space
2026cites this paper
Alert2Vec: Eliminating Alert Fatigue by Embedding Security Alerts Through Subgraph Learning
2026cites this paper
Comparing the ability of embedding methods on metabolic hypergraphs for capturing taxonomy-based features
2026cites this paper
Margin-based angular losses for lightweight text classification: Lessons from face recognition
2026cites this paper
Leveraging textual content, citational aspects and dissenting opinions through a multi-view contrastive learning methodology for legal precedent analysis
2026cites this paper
A joint graph neural network model incorporating rhetorical structure theory
2026cites this paper
Cost-Adaptive Multi-Level Semantic Feature Learning for Source Code based Bug Severity Prediction
2026cites this paper
LLM-powered Real-time Patent Citation Recommendation for Financial Technologies
2026cites this paper
Fuzzy Constraints for Knowledge Graph Embeddings
2026cites this paper
Occupational agglomeration and R&D worker mobility: career opportunities and wage premiums
2026cites this paper
CTRL: Continuous-time representation learning on temporal heterogeneous information network
2026cites this paper
Fuzzy BERTopic: A neural multi-topic modeling approach based on BERT and Fuzzy clustering
2026cites this paper
When Numbers Tell Half the Story: Human-Metric Alignment in Topic Model Evaluation
2026cites this paper
Density-Guided Response Optimization: Community-Grounded Alignment via Implicit Acceptance Signals
2026cites this paper
Urban multi-scale information graph and fusion model
2026cites this paper
AIM review tool: artificial intelligence for smarter systematic review screening
2026cites this paper
Research on code defect detection technology based on multimodal representation learning
2026cites this paper
Topeax - An Improved Clustering Topic Model with Density Peak Detection and Lexical-Semantic Term Importance
2026cites this paper
A Retrieval-Based Approach to Medical Procedure Matching in Romanian
2025cites this paper
Knowledge Transfer from LLMs to Provenance Analysis: A Semantic-Augmented Method for APT Detection
2025influential citation
Top2Vec Topic Modeling to Analyze the Dynamics of Publication Activity Related to Environmental Monitoring Using Unmanned Aerial Vehicles
2025influential citation
Machine Learning Approaches to Code Similarity Measurement: A Systematic Review
2025cites this paper
Fine-tuning AraGPT2 for Hierarchical Arabic Text Classification
2025cites this paper
Assessing the effectiveness of ROUGE as unbiased metric in Extractive vs. Abstractive summarization techniques
2025cites this paper
Using similarity network analysis to improve text similarity calculations
2025cites this paper
Knowledge Graphs for Multi-modal Learning: Survey and Perspective
2025cites this paper
Towards building Urdu language document retrieval framework
2025cites this paper
Synthetic Data-Driven Approaches for Chinese Medical Abstract Sentence Classification: Computational Study
2025cites this paper
Identification and Blocking of Hate Speech in English-Hausa Code-mixed Language on Facebook: A Hybrid Deep Learning Framework
2025cites this paper
BERTDetect: A Neural Topic Modelling Approach for Android Malware Detection
2025cites this paper
Mapping Hymns and Organizing Concepts in the Rigveda: Quantitatively Connecting the Vedic Suktas
2025influential citation
Explainable identification of similarities between entities for discovery in large text
2025cites this paper
Large language models deconstruct the clinical intuition behind diagnosing autism.
2025influential citation
Stock market forecasting based on machine learning: The role of investor sentiment
2025cites this paper
Syngo: synthetic genetic oversampling technique for textual data
2025cites this paper
A survey on the recent random walk-based methods for embedding graphs
2025cites this paper
An Optimization Algorithm for Multimodal Data Alignment
2025cites this paper
Gemini Embedding: Generalizable Embeddings from Gemini
2025cites this paper
Training Plug-n-Play Knowledge Modules with Deep Context Distillation
2025cites this paper
Deep Incomplete Multi-view Clustering with Distribution Dual-Consistency Recovery Guidance
2025cites this paper
Using cross‐encoders to measure the similarity of short texts in political science
2025cites this paper
Impact of rounded rating display and confidence cues on the subsequent reviews
2025cites this paper
Heterogeneous bimodal attention fusion for speech emotion recognition
2025cites this paper
SmartTrans: Advanced Similarity Analysis for Detecting Vulnerabilities in Ethereum Smart Contracts
2025influential citation
Set-Theoretic Compositionality of Sentence Embeddings
2025cites this paper
Deep Learning Aided Software Vulnerability Detection: A Survey
2025cites this paper
Fine-Grained Alignment Network for Zero-Shot Cross-Modal Retrieval
2025cites this paper
ASTR: Transformer-based Alert-to-Stage Translator for multi-stage attack detection
2025cites this paper
CR-deal: Explainable Neural Network for circRNA-RBP Binding Site Recognition and Interpretation
2025cites this paper
Enhancing Recommender Systems: Deep Modality Alignment with Large Multi-Modal Encoders
2025cites this paper
Job relatedness, local skill coherence and economic performance: a job postings approach
2025cites this paper
A systematic review of automated hyperpartisan news detection
2025cites this paper
exHarmony: Authorship and Citations for Benchmarking the Reviewer Assignment Problem
2025cites this paper
Improving Paragraph Similarity by Sentence Interaction With BERT
2025cites this paper
Comparison of algorithms for the recognition of ChatGPT paraphrased texts
2025cites this paper
NLP Based Protein Sequence Classification using CNN
2025cites this paper
Content-aware sentiment understanding: cross-modal analysis with encoder-decoder architectures
2025cites this paper
Enhancing Semantic Search Precision through the CBOW Algorithm in the Semantic Web
2025cites this paper
Expert Systems With Applications
2025cites this paper
Together Is Better: Knowledge-aware Model with Resume Fusion for Online Job Recommendation
2025cites this paper
SDRank: A shallow-to-deep ranking framework for enhanced unsupervised keyphrase extraction
2025cites this paper
A brief survey of deep learning-based models for CircRNA-protein binding sites prediction
2025cites this paper
Cramming 1568 Tokens into a Single Vector and Back Again: Exploring the Limits of Embedding Space Capacity
2025cites this paper
A Survey of Model Architectures in Information Retrieval
2025cites this paper
A Hybrid Model Integrating LDA, BERT, and Clustering for Enhanced Topic Modeling
2025cites this paper
Shortening Psychological Scales: Semantic Similarity Matters
2025cites this paper
Analisis Perbandingan Teknik Word2vec dan Doc2vec dalam Mengukur Kemiripan Dokumen Menggunakan Cosine Similarity
2025cites this paper
A joint-training topic model for social media texts
2025cites this paper
MAPS: Motivation-Aware Personalized Search via LLM-Driven Consultation Alignment
2025cites this paper
A deep learning approach to understanding controlled ovarian stimulation and in vitro fertilization dynamics
2025cites this paper
American social media users have ideological differences of opinion about the War in Ukraine
2025cites this paper