SPECTER: Document-level Representation Learning using Citation-informed Transformers

Arman Cohan,Sergey Feldman,Iz Beltagy,Doug Downey,Daniel S. Weld

Published 2020 in Annual Meeting of the Association for Computational Linguistics

ABSTRACT

Representation learning is a critical ingredient for natural language processing systems. Recent Transformer language models like BERT learn powerful textual representations, but these models are targeted towards token- and sentence-level training objectives and do not leverage information on inter-document relatedness, which limits their document-level representation power. For applications on scientific documents, such as classification and recommendation, accurate embeddings of documents are a necessity. We propose SPECTER, a new method to generate document-level embedding of scientific papers based on pretraining a Transformer language model on a powerful signal of document-level relatedness: the citation graph. Unlike existing pretrained language models, Specter can be easily applied to downstream applications without task-specific fine-tuning. Additionally, to encourage further research on document-level models, we introduce SciDocs, a new evaluation benchmark consisting of seven document-level tasks ranging from citation prediction, to document classification and recommendation. We show that Specter outperforms a variety of competitive baselines on the benchmark.

PUBLICATION RECORD

Publication year
2020
Venue
Annual Meeting of the Association for Computational Linguistics
Publication date
2020-04-15
Fields of study
Computer Science
Identifiers
DOI 10.18653/v1/2020.acl-main.207 arXiv 2004.07180
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

A Scalable Hybrid Research Paper Recommender System for Microsoft Academic
2019cited by this paper
A context-aware citation recommendation model with BERT and graph convolutional networks
2019cited by this paper
Simplifying Graph Convolutional Networks
2019influential reference
A Comprehensive Survey on Graph Neural Networks
2019influential reference
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
2019cited by this paper
RoBERTa: A Robustly Optimized BERT Pretraining Approach
2019cited by this paper
SciBERT: A Pretrained Language Model for Scientific Text
2019influential reference
DisSent: Learning Sentence Representations from Explicit Discourse Relations
2019cited by this paper
Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
2019influential reference
Quantifying Sex Bias in Clinical Studies at Scale With Automated Data Extraction
2019cited by this paper
Improving Textual Network Learning with Variational Homophilic Embeddings
2019influential reference
XLNet: Generalized Autoregressive Pretraining for Language Understanding
2019cited by this paper
Improving Textual Network Embedding with Global Attention via Optimal Transport
2019influential reference
Estimating Position Bias without Intrusive Interventions
2018cited by this paper
Content-Based Citation Recommendation
2018influential reference
Deep Contextualized Word Representations
2018cited by this paper
AllenNLP: A Deep Semantic Natural Language Processing Platform
2018cited by this paper
Neural vector spaces for unsupervised information retrieval
2018cited by this paper
Construction of the Literature Graph in Semantic Scholar
2018cited by this paper
Universal Language Model Fine-tuning for Text Classification
2018influential reference
Diffusion Maps for Textual Network Embedding
2018cited by this paper
Explaining Away Syntactic Structure in Semantic Document Representations
2018cited by this paper
Improving Language Understanding by Generative Pre-Training
2018cited by this paper
Research Paper Recommender Systems on Big Scholarly Data
2018cited by this paper
Improved Semantic-Aware Network Embedding with Fine-Grained Word Alignment
2018cited by this paper
From Neural Re-Ranking to Neural Ranking: Learning a Sparse Representation for Inverted Indexing
2018cited by this paper
Word Mover’s Embedding: From Word2Vec to Document Embedding
2018cited by this paper
Attention is All you Need
2017influential reference
A Simple but Tough-to-Beat Baseline for Sentence Embeddings
2017influential reference
Supervised Learning of Universal Sentence Representations from Natural Language Inference Data
2017cited by this paper
A model of extended paragraph vector for document categorization and trend analysis
2017cited by this paper
Inductive Representation Learning on Large Graphs
2017influential reference
CANE: Context-Aware Network Embedding for Relation Modeling
2017cited by this paper
A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference
2017cited by this paper
Unsupervised Document Embedding With CNNs
2017cited by this paper
Doc2Sent2Vec: A Novel Two-Phase Approach for Learning Document Representation
2016cited by this paper
An Empirical Evaluation of doc2vec with Practical Insights into Document Embedding Generation
2016cited by this paper
Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation
2016cited by this paper
Enriching Word Vectors with Subword Information
2016cited by this paper
Skip-Thought Vectors
2015cited by this paper
An Overview of Microsoft Academic Service (MAS) and Applications
2015influential reference
Adam: A Method for Stochastic Optimization
2014influential reference
Accelerating t-SNE using tree-based algorithms
2014cited by this paper
Distributed Representations of Sentences and Documents
2014cited by this paper
Spectral Networks and Locally Connected Networks on Graphs
2013cited by this paper
Scikit-learn: Machine Learning in Python
2011cited by this paper
Software Framework for Topic Modelling with Large Corpora
2010cited by this paper
The structural and content aspects of abstracts versus bodies of full text journal articles are different
2010cited by this paper
Is searching full text more effective than searching abstracts?
2009cited by this paper
V-Measure: A Conditional Entropy-Based External Cluster Evaluation Measure
2007cited by this paper
Collaborative Filtering Recommender Systems
2007cited by this paper
Identifying a better measure of relatedness for mapping science
2006cited by this paper
Distribution of information in biomedical abstracts and full-text publications
2004cited by this paper
Medical Subject Headings (MeSH).
2000cited by this paper
A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise
1996cited by this paper
Bulletin of the medical library association april 1985.
1985cited by this paper

CITED BY

Navigating Ideation Space: Decomposed Conceptual Representations for Positioning Scientific Ideas
2026cites this paper
Col-Bandit: Zero-Shot Query-Time Pruning for Late-Interaction Retrieval
2026cites this paper
UniFAR: A Unified Facet-Aware Retrieval Framework for Scientific Documents
2026influential citation
Attend to what I say: Highlighting relevant content on slides
2026cites this paper
CSRv2: Unlocking Ultra-Sparse Embeddings
2026cites this paper
Bagging-Based Model Merging for Robust General Text Embeddings
2026cites this paper
How Do We Engage with Other Disciplines? A Framework to Study Meaningful Interdisciplinary Discourse in Scholarly Publications
2026cites this paper
Causal Retrieval via Semantic Regularization
2026cites this paper
LLM-Confidence Reranker: A Training-Free Approach for Enhancing Retrieval-Augmented Generation Systems
2026cites this paper
A citation recommendation model employing knowledge graph embedding
2026cites this paper
Overcoming the Retrieval Barrier: Indirect Prompt Injection in the Wild for LLM Systems
2026cites this paper
Less Finetuning, Better Retrieval: Rethinking LLM Adaptation for Biomedical Retrievers via Synthetic Data and Model Merging
2026cites this paper
LitBench: A Graph-Centric Large Language Model Benchmarking Tool For Literature Tasks
2026cites this paper
DRAMA: Domain Retrieval using Adaptive Module Allocation
2026cites this paper
Artificial intelligence tools expand scientists’ impact but contract science’s focus
2026cites this paper
Automatically investigating scientific discussions in peer review reports based on conformity score metrics
2026cites this paper
NeuroVLM: A generative vision-language framework for human neuroimaging
2026cites this paper
ReAttn: Improving Attention-based Re-ranking via Attention Re-weighting
2026cites this paper
Geodesic Semantic Search: Learning Local Riemannian Metrics for Citation Graph Retrieval
2026cites this paper
Semantic search approach for information extraction from traffic crash narratives
2026cites this paper
Enhancing Academic Paper Recommendations Using Fine-Grained Knowledge Entities and Multifaceted Document Embeddings
2026cites this paper
TextTAGC: a query-oriented scientific paper recommendation model based on temporal-aware graph convolution
2026influential citation
Improving Scientific Document Retrieval with Academic Concept Index
2026cites this paper
SPECTER-BS: effective citation recommendation using SPECTER with bibliographic scoring
2026influential citation
More Than Efficiency: Embedding Compression Improves Domain Adaptation in Dense Retrieval
2026cites this paper
RATE: Reviewer Profiling and Annotation-free Training for Expertise Ranking in Peer Review Systems
2026influential citation
Structural Chunking: A Semantic-Structural Integrated Method for Retrieval-Augmented Generation
2026cites this paper
AIM review tool: artificial intelligence for smarter systematic review screening
2026cites this paper
APRES: An Agentic Paper Revision and Evaluation System
2026cites this paper
IRPAPERS: A Visual Document Benchmark for Scientific Retrieval and Question Answering
2026cites this paper
From reviews to decisions: A joint multitask aspect sentiment leveraged framework for assisting decision prediction from academic peer reviews
2026cites this paper
MVSS: A Unified Framework for Multi-View Structured Survey Generation
2026cites this paper
Efficient Time-Restricted kNN Search in High-Dimensional Data Using Multi-Level Block Indexing, with Extensions to Multi-Attribute Filtering
2026cites this paper
What Should I Cite? A RAG Benchmark for Academic Citation Prediction
2026cites this paper
Cropping outperforms dropout as an augmentation strategy for training self-supervised text embeddings
2025influential citation
Distillation versus Contrastive Learning: How to Train Your Rerankers
2025cites this paper
Agentic memory-augmented retrieval and evidence grounding for medical question-answering tasks
2025cites this paper
Overcoming Data Scarcity: Guiding Citation Function Classification With Prompt-Based Few-Shot Learning
2025cites this paper
Four Shades of Life Sciences: A Dataset for Disinformation Detection in the Life Sciences
2025cites this paper
From Ambiguity to Accuracy: The Transformative Effect of Coreference Resolution on Retrieval-Augmented Generation systems
2025cites this paper
GenGO Ultra: an LLM-powered ACL Paper Explorer
2025cites this paper
Distilling a Small Utility-Based Passage Selector to Enhance Retrieval-Augmented Generation
2025cites this paper
Literature-Grounded Novelty Assessment of Scientific Ideas
2025cites this paper
A Comparative Study of Specialized LLMs as Dense Retrievers
2025cites this paper
A data mining-based study on academic publication retractions in the 21st Century
2025cites this paper
Bibliographic network enhanced local citation recommendation
2025cites this paper
Extracting Information About Publication Venues Using Citation-Informed Transformers
2025influential citation
Citation structural diversity: a novel metric combining structure and semantics for literature evaluation
2025cites this paper
Tracing the evolution of household carbon emission research by machine learning
2025cites this paper
Exploratory Analysis of Scientific Publications for University Governance
2025cites this paper
A Survey of AI for Materials Science: Foundation Models, LLM Agents, Datasets, and Tools
2025cites this paper
Density, asymmetry and citation dynamics in scientific literature
2025cites this paper
Bridging Modalities: Improving Universal Multimodal Retrieval by Multimodal Large Language Models
2025cites this paper
CA-GAR: Context-Aware Alignment of LLM Generation for Document Retrieval
2025cites this paper
SciArena: An Open Evaluation Platform for Non-Verifiable Scientific Literature-Grounded Tasks
2025cites this paper
Efficient Re-ranking with Cross-encoders via Early Exit
2025cites this paper
A Dynamical Cartography of the Epistemic Diffusion of Artificial Intelligence in Neuroscience
2025influential citation
Leveraging multiple control codes for aspect-controllable related paper recommendation
2025cites this paper
SPAR: Scholar Paper Retrieval with LLM-based Agents for Enhanced Academic Search
2025cites this paper
SemCSE: Semantic Contrastive Sentence Embeddings Using LLM-Generated Summaries For Scientific Abstracts
2025influential citation
Uncertainty-driven Embedding Convolution
2025cites this paper
Balancing the Blend: An Experimental Analysis of Trade-offs in Hybrid Search
2025cites this paper
Chunk Twice, Embed Once: A Systematic Study of Segmentation and Representation Trade-offs in Chemistry-Aware Retrieval-Augmented Generation
2025cites this paper
TITE: Token-Independent Text Encoder for Information Retrieval
2025cites this paper
PaperRegister: Boosting Flexible-grained Paper Search via Hierarchical Register Indexing
2025cites this paper
Internal and External Impacts of Natural Language Processing Papers
2025cites this paper
LightRetriever: A LLM-based Hybrid Retrieval Architecture with 1000x Faster Query Inference
2025influential citation
Conventional Contrastive Learning Often Falls Short: Improving Dense Retrieval with Cross-Encoder Listwise Distillation and Synthetic Data
2025influential citation
VizCV: AI-assisted visualization of researchers' publications tracks
2025cites this paper
QBD-RankedDataGen: Generating Custom Ranked Datasets for Improving Query-By-Document Search Using LLM-Reranking with Reduced Human Effort
2025cites this paper
Benchmarking Retrieval-Augmented Generation for Chemistry
2025cites this paper
Graph-Assisted Culturally Adaptable Idiomatic Translation for Indic Languages
2025cites this paper
Encoder models at the European Patent Office: Pre-training and use cases
2025cites this paper
VICTORIOUS: A Visual Analytics System for Scoping Review of Document Sets
2025cites this paper
The influence of grant renewal on research content: evidence from NIH-funded PIs
2025cites this paper
Scholar Inbox: Personalized Paper Recommendations for Scientists
2025cites this paper
Research Paper Recommender System by Considering Users’ Information Seeking Behaviors
2025influential citation
Horizon Scans can be accelerated using novel information retrieval and artificial intelligence tools
2025cites this paper
Causal Retrieval with Semantic Consideration
2025cites this paper
CSPLADE: Learned Sparse Retrieval with Causal Language Models
2025cites this paper
Unleashing the Power of LLMs in Dense Retrieval with Query Likelihood Modeling
2025cites this paper
Utility-Focused LLM Annotation for Retrieval and Retrieval-Augmented Generation
2025cites this paper
Exploring similarity patterns in a large scientific corpus
2025cites this paper
Impact of classification granularity on interdisciplinary performance assessment of research institutes and organizations
2025cites this paper
Enhancing automated indexing of publication types and study designs in biomedical literature using full–text features
2025influential citation
Partisan disparities in the use of science in policy.
2025cites this paper
Retrieval-Augmented Generation in Biomedicine: A Survey of Technologies, Datasets, and Clinical Applications
2025cites this paper
AdaParse: An Adaptive Parallel PDF Parsing and Resource Scaling Engine
2025cites this paper
Goal Setting in Accounting Research: A Systematic Review and Reflections on Future Research Opportunities With AI‐Assisted Augmentation
2025cites this paper
BioMedTools: a language model-powered community for biomedical computational tools
2025cites this paper
LDIR: Low-Dimensional Dense and Interpretable Text Embeddings with Relative Representations
2025cites this paper
Missing vs. Unused Knowledge Hypothesis for Language Model Bottlenecks in Patent Understanding
2025cites this paper
Adapting Pretrained Language Models for Citation Classification via Self-Supervised Contrastive Learning
2025cites this paper
tRAG: Term-level Retrieval-Augmented Generation for Domain-Adaptive Retrieval
2025influential citation
A Hybrid Framework for Predicting Citation Harm from Retracted Papers
2025cites this paper
LGAI-EMBEDDING-Preview Technical Report
2025cites this paper
The AI Imperative: Scaling High-Quality Peer Review in Machine Learning
2025cites this paper
ScIRGen: Synthesize Realistic and Large-Scale RAG Dataset for Scientific Research
2025influential citation
MoR: Better Handling Diverse Queries with a Mixture of Sparse, Dense, and Human Retrievers
2025influential citation
Investigating Task Arithmetic for Zero-Shot Information Retrieval
2025cites this paper