Benchmarking DNA Foundation Models for Genomic Sequence Classification

Haonan Feng,Lang Wu,Bingxin Zhao,Chad Huff,Jianjun Zhang,Jia Wu,Lifeng Lin,Peng Wei,Chong Wu

Published 2024 in bioRxiv

ABSTRACT

The rapid evolution of DNA foundation models promises to revolutionize genomics, yet comprehensive evaluations are lacking. Here, we present a comprehensive, unbiased benchmark of five models (DNABERT-2, Nucleotide Transformer V2, HyenaDNA, Caduceus-Ph, and GROVER) across diverse genomic and genetic tasks including sequence classification, gene expression prediction, variant effect quantification, and topologically associating domain (TAD) region recognition, using zero-shot embeddings. Our analysis reveals that mean token embedding consistently and significantly improves sequence classification performance, outperforming other pooling strategies. Model performance varies among tasks and datasets; while general purpose DNA foundation models showed competitive performance in pathogenic variant identification, they were less effective in predicting gene expression and identifying putative causal QTLs compared to specialized models. Our findings offer a framework for model selection, highlighting the impact of architecture, pre-training data, and embedding strategies on performance in genomic and genetic tasks.

PUBLICATION RECORD

Publication year
2024
Venue
bioRxiv
Publication date
2024-08-18
Fields of study
Biology, Medicine, Computer Science
Identifiers
DOI 10.1101/2024.08.16.608288 PMID 39185205 PMCID 11343214
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar, PubMed

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Qwen3 Technical Report
2025cited by this paper
Caduceus: Bi-Directional Equivariant Long-Range DNA Sequence Modeling
2024cited by this paper
scGPT: toward building a foundation model for single-cell multi-omics using generative AI
2024cited by this paper
Identification of DNase I hypersensitive sites in the human genome by multiple sequence descriptors.
2024cited by this paper
Survey of transformers and towards ensemble learning using transformers for natural language processing
2024cited by this paper
The Nucleotide Transformer: Building and Evaluating Robust Foundation Models for Human Genomics
2024cited by this paper
Nucleotide Transformer: building and evaluating robust foundation models for human genomics
2024cited by this paper
DNA language model GROVER learns sequence context in the human genome
2024cited by this paper
BEND: Benchmarking DNA Language Models on biologically meaningful tasks
2023cited by this paper
Mistral 7B
2023cited by this paper
Predicting RNA-seq coverage from DNA sequence as a unifying model of gene regulation
2023cited by this paper
Llama 2: Open Foundation and Fine-Tuned Chat Models
2023cited by this paper
HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide Resolution
2023influential reference
DNABERT-2: Efficient Foundation Model and Benchmark For Multi-Species Genome
2023influential reference
GPT-4 Technical Report
2023cited by this paper
Large language models generate functional protein sequences across diverse families
2023cited by this paper
Why do tree-based models still outperform deep learning on typical tabular data?
2022cited by this paper
iDNA-ABF: multi-scale deep biological language learning model for the interpretable prediction of DNA methylations
2022cited by this paper
Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning
2022cited by this paper
Genomic benchmarks: a collection of datasets for genomic sequence classification
2022cited by this paper
iPro-WAEL: a comprehensive and robust framework for identifying promoters in multiple species
2022influential reference
Parameter-Efficient Tuning with Special Token Adaptation
2022cited by this paper
A sequence-based global map of regulatory activity for deciphering human genetics
2021cited by this paper
High-coverage whole-genome sequencing of the expanded 1000 Genomes Project cohort including 602 trios
2021cited by this paper
Effective gene expression prediction from sequence by integrating long-range interactions
2021cited by this paper
The Power of Scale for Parameter-Efficient Prompt Tuning
2021cited by this paper
Epigenetic Patterns in a Complete Human Genome
2021cited by this paper
Evaluating Large Language Models Trained on Code
2021cited by this paper
iDNA-ABT: advanced deep learning model for detecting DNA methylation with adaptive features and transductive information maximization
2021cited by this paper
5分で分かる!? 有名論文ナナメ読み：Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding
2020influential reference
Overview of the Transformer-based Models for NLP Tasks
2020cited by this paper
Deep4mC: systematic assessment and computational prediction for DNA N4-methylcytosine sites by deep learning
2020cited by this paper
iDNA-MS: An Integrated Computational Tool for Detecting DNA Modification Sites in Multiple Genomes
2020cited by this paper
The GTEx Consortium atlas of genetic regulatory effects across human tissues
2019cited by this paper
Deep learning for DNase I hypersensitive sites identification
2018cited by this paper
iDHS-EL: identifying DNase I hypersensitive sites by fusing three different modes of pseudo nucleotide composition into an ensemble learning framework
2016cited by this paper
Understanding Transcription Factor Regulation by Integrating Gene Expression and DNase I Hypersensitive Sites
2015cited by this paper
On comparing partitions
2015cited by this paper
Genome Reference Consortium
2013cited by this paper
Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance
2010cited by this paper
DNA variants in the dihydrofolate reductase gene and outcome in childhood ALL.
2008cited by this paper
Predicting the in vivo signature of human gene regulatory sequence
2005cited by this paper
The tetraspanin superfamily member CD151 regulates outside-in integrin alphaIIbbeta3 signaling and platelet function.
2004cited by this paper
The tetraspanin superfamily member CD151 regulates outside-in integrin αIIbβ3 signaling and platelet function
2004cited by this paper
Random Forests
2001cited by this paper
Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach.
1988cited by this paper
Silhouettes: a graphical aid to the interpretation and validation of cluster analysis
1987cited by this paper
Comparison of the predicted and observed secondary structure of T4 phage lysozyme.
1975cited by this paper

CITED BY

The DNA dialect: a comprehensive guide to pretrained genomic language models
2026cites this paper
Fast and alignment-free flavivirus classification from low-coverage genomes
2026cites this paper
BioReason: Incentivizing Multimodal Biological Reasoning within a DNA-LLM Model
2025cites this paper
In silico prediction of variant effects: promises and limitations for precision plant breeding
2025cites this paper
Unsupervised evaluation of pre-trained DNA language model embeddings
2025cites this paper
NextVir: Enabling classification of tumor-causing viruses with genomic foundation models
2025cites this paper
Pre-training Genomic Language Model with Variants for Better Modeling Functional Genomics
2025cites this paper
Digital to Biological Translation: How the Algorithmic Data-Driven Design Reshapes Synthetic Biology
2025cites this paper
Improving DNA Modeling with WaveDNA: Enhancing Speed, Generalizability, and Interpretability through Wavelet Transformation
2025cites this paper
Prediction of DNA Methylation With Long-Range State-Space Models
2025cites this paper
vir2vec: A Viral Genome-Wide Viral Embedding
2025cites this paper
Disease-Specific Prediction of Missense Variant Pathogenicity with DNA Language Models and Graph Neural Networks
2025cites this paper
Beating Transformers using Synthetic Cognition
2025cites this paper