Open-Source Sequence Clustering Methods Improve the State Of the Art

Evguenia Kopylova,Jose A Navas-Molina,C. Mercier,Z. Xu,F. Mahé,Yan He,Hong-Wei Zhou,Torbjørn Rognes,J. Caporaso,R. Knight

Published 2016 in mSystems

ABSTRACT

Massive collections of next-generation sequencing data call for fast, accurate, and easily accessible bioinformatics algorithms to perform sequence clustering. A comprehensive benchmark is presented, including open-source tools and the popular USEARCH suite. Simulated, mock, and environmental communities were used to analyze sensitivity, selectivity, species diversity (alpha and beta), and taxonomic composition. The results demonstrate that recent clustering algorithms can significantly improve accuracy and preserve estimated diversity without the application of aggressive filtering. Moreover, these tools are all open source, apply multiple levels of multithreading, and scale to the demands of modern next-generation sequencing data, which is essential for the analysis of massive multidisciplinary studies such as the Earth Microbiome Project (EMP) (J. A. Gilbert, J. K. Jansson, and R. Knight, BMC Biol 12:69, 2014, http://dx.doi.org/10.1186/s12915-014-0069-1 ). ABSTRACT Sequence clustering is a common early step in amplicon-based microbial community analysis, when raw sequencing reads are clustered into operational taxonomic units (OTUs) to reduce the run time of subsequent analysis steps. Here, we evaluated the performance of recently released state-of-the-art open-source clustering software products, namely, OTUCLUST, Swarm, SUMACLUST, and SortMeRNA, against current principal options (UCLUST and USEARCH) in QIIME, hierarchical clustering methods in mothur, and USEARCH’s most recent clustering algorithm, UPARSE. All the latest open-source tools showed promising results, reporting up to 60% fewer spurious OTUs than UCLUST, indicating that the underlying clustering algorithm can vastly reduce the number of these derived OTUs. Furthermore, we observed that stringent quality filtering, such as is done in UPARSE, can cause a significant underestimation of species abundance and diversity, leading to incorrect biological results. Swarm, SUMACLUST, and SortMeRNA have been included in the QIIME 1.9.0 release. IMPORTANCE Massive collections of next-generation sequencing data call for fast, accurate, and easily accessible bioinformatics algorithms to perform sequence clustering. A comprehensive benchmark is presented, including open-source tools and the popular USEARCH suite. Simulated, mock, and environmental communities were used to analyze sensitivity, selectivity, species diversity (alpha and beta), and taxonomic composition. The results demonstrate that recent clustering algorithms can significantly improve accuracy and preserve estimated diversity without the application of aggressive filtering. Moreover, these tools are all open source, apply multiple levels of multithreading, and scale to the demands of modern next-generation sequencing data, which is essential for the analysis of massive multidisciplinary studies such as the Earth Microbiome Project (EMP) (J. A. Gilbert, J. K. Jansson, and R. Knight, BMC Biol 12:69, 2014, http://dx.doi.org/10.1186/s12915-014-0069-1 ).

PUBLICATION RECORD

Publication year
2016
Venue
mSystems
Publication date
2016-02-09
Fields of study
Biology, Medicine, Computer Science, Environmental Science
Identifiers
DOI 10.1128/mSystems.00003-15 PMID 27822515 PMCID 5069751
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar, PubMed

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

The human microbiome
2017cited by this paper
MICCA: a complete and accurate software for taxonomic profiling of metagenomic data
2015cited by this paper
Swarm v2: highly-scalable and high-resolution amplicon clustering
2015influential reference
De novo clustering methods outperform reference-based methods for assigning 16S rRNA gene sequences to operational taxonomic units
2015cited by this paper
Subsampled open-reference clustering creates consistent, comprehensive OTU definitions and scales to billions of sequences
2014cited by this paper
Biogeographic patterns in below-ground diversity in New York City's Central Park are similar to those observed globally
2014cited by this paper
Conditionally Rare Taxa Disproportionately Contribute to Temporal Changes in Microbial Diversity
2014cited by this paper
Development of a Dual-Index Sequencing Strategy and Curation Pipeline for Analyzing Amplicon Sequence Data on the MiSeq Illumina Sequencing Platform
2013cited by this paper
UPARSE: highly accurate OTU sequences from microbial amplicon reads
2013cited by this paper
Advancing our understanding of the human microbiome using QIIME.
2013cited by this paper
Quality-filtering vastly improves diversity estimates from Illumina amplicon sequencing
2012cited by this paper
Structure, Function and Diversity of the Healthy Human Microbiome
2012cited by this paper
The Biological Observation Matrix (BIOM) format or: how I learned to stop worrying and love the ome-ome
2012cited by this paper
A framework for human microbiome research
2012cited by this paper
SortMeRNA: fast and accurate filtering of ribosomal RNAs in metatranscriptomic data
2012cited by this paper
ART: a next-generation sequencing read simulator
2012cited by this paper
Human gut microbiome viewed across age and geography
2012cited by this paper
An improved Greengenes taxonomy with explicit ranks for ecological and evolutionary analyses of bacteria and archaea
2011influential reference
Open resource metagenomics: a model for sharing metagenomic libraries
2011cited by this paper
UCHIME improves sensitivity and speed of chimera detection
2011cited by this paper
PrimerProspector: de novo design and taxonomic analysis of barcoded polymerase chain reaction primers
2011cited by this paper
UniFrac: an effective distance metric for microbial community comparison
2011influential reference
Reducing the Effects of PCR Amplification and Sequencing Artifacts on 16S rRNA-Based Studies
2011cited by this paper
QIIME allows analysis of high-throughput community sequencing data
2010cited by this paper
Evaluating high‐throughput sequencing as a method for metagenomic analysis of nematode diversity
2009cited by this paper
Fast UniFrac: Facilitating high-throughput phylogenetic analyses of microbial communities including analysis of pyrosequencing and PhyloChip data
2009influential reference
Introducing mothur: Open-Source, Platform-Independent, Community-Supported Software for Describing and Comparing Microbial Communities
2009cited by this paper
Bacterial Community Variation in Human Body Habitats Across Space and Time
2009cited by this paper
SILVA: a comprehensive online resource for quality checked and aligned ribosomal RNA sequence data compatible with ARB
2007influential reference
Naïve Bayesian Classifier for Rapid Assignment of rRNA Sequences into the New Bacterial Taxonomy
2007cited by this paper
The Human Microbiome Project
2007cited by this paper
Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences
2006cited by this paper
Greengenes, a Chimera-Checked 16S rRNA Gene Database and Workbench Compatible with ARB
2006influential reference
Introducing DOTUR, a Computer Program for Defining Operational Taxonomic Units and Estimating Species Richness
2005influential reference
Tolerating some redundancy significantly speeds up clustering of large protein databases
2002cited by this paper
Clustering of highly homologous sequences to reduce the size of large protein databases
2001cited by this paper
Selection of representative protein data sets
1992cited by this paper
Conservation evaluation and phylogenetic diversity
1992cited by this paper
Basic local alignment search tool.
1990influential reference
Generalized procrustes analysis
1975cited by this paper
Distributed under Creative Commons Cc-by 4.0 Swarm: Robust and Fast Clustering Method for Amplicon-based Studies
year unknowninfluential reference

CITED BY

Amplicon sequencing reveals the cryptic diversity in the dicyemid parasites of coleoid cephalopods sampled from the Atlantic and Pacific Oceans
2026cites this paper
A Tutorial Toolbox to Simplify Bioinformatics and Biostatistics Analyses of Microbial Omics Data in an Island Context
2025cites this paper
Benchmarking the Taxonomic Resolution of Fish eDNA Metabarcodes Against COI Barcodes
2025cites this paper
Assessing the value of bacteria, plants, fungi and arthropods characterized via DNA metabarcoding for separation of forensic-like surface soils at varied spatial scales.
2025cites this paper
Maximizing Identification Precision of Hymenoptera and Brachycera (Diptera) With a Non‐Destructive DNA Metabarcoding Approach
2025cites this paper
Disease-resistant watermelon variety against Fusarium wilt by remodeling rhizosphere soil microenvironment
2025influential citation
clustur: an R package for clustering features using sparse distance matrices
2025cites this paper
Microbial communities in the rhizosphere of tropical soils cultivated with maize as a function of nitrogen and phosphorus fertilizers
2025cites this paper
Metatranscriptomic analysis reveals gut microbiome bacterial genes in pyruvate and amino acid metabolism associated with hyperuricemia and gout in humans
2025cites this paper
Insights into human respiratory microbiome under dysbiosis and its analysis tool
2025cites this paper
Can Widely Used Methods Be Turned Into eDNA Samplers for Ground‐Dwelling Arthropods? Insights From Two Pilot Studies in West European Salt Marshes
2025cites this paper
Target-driven optimization of feature representation and model selection for microbiome sequencing data with ritme
2025cites this paper
Bacterial community in biological soil crusts from a Brazilian semiarid region under desertification process
2024cites this paper
Analyzing microbial community and volatile compound profiles in the fermentation of cigar tobacco leaves
2024cites this paper
Spatial and seasonal biodiversity variation in a large Mediterranean lagoon using environmental DNA metabarcoding through sponge tissue collection
2024cites this paper
Experimental evaluation of genetic variability based on DNA metabarcoding from the aquatic environment: Insights from the Leray COI fragment
2024cites this paper
Population-based nanopore sequencing of the HIV-1 pangenome to identify drug resistance mutations
2024cites this paper
A Custom Regional DNA Barcode Reference Library for Lichen-Forming Fungi of the Intermountain West, USA, Increases Successful Specimen Identification
2023cites this paper
Effects of black soldier fly meal feeding on rainbow trout gut microbiota, immune-related gene expression, and Lactococcus petauri resistance
2023cites this paper
Very early life microbiome and metabolome correlates with primary vaccination variability in children
2023influential citation
The Colorectal Cancer Microbiota Alter Their Transcriptome To Adapt to the Acidity, Reactive Oxygen Species, and Metabolite Availability of Gut Microenvironments
2023cites this paper
The Colorectal Cancer Gut Environment Regulates Activity of the Microbiome and Promotes the Multidrug Resistant Phenotype of ESKAPE and Other Pathogens
2023cites this paper
Comparison of Methods for Biological Sequence Clustering
2023cites this paper
Increased prokaryotic diversity in the Red Sea deep scattering layer
2023cites this paper
Construction of a synthetic microbial community based on multiomics linkage technology and analysis of the mechanism of lignocellulose degradation.
2023cites this paper
Microbial Community Analysis and Food Safety Practice Survey-Based Hazard Identification and Risk Assessment for Controlled Environment Hydroponic/Aquaponic Farming Systems
2022cites this paper
Land degradation affects the microbial communities in the Brazilian Caatinga biome
2022cites this paper
Topographic Attributes Override Impacts of Agronomic Practices on Prokaryotic Community Structure
2022cites this paper
Microbial communities in the rhizosphere of maize and cowpea respond differently to chromium contamination.
2022cites this paper
Designing a surveillance program for early detection of alien plants and insects in Norway
2022cites this paper
Combination of Whole Genome Sequencing and Metagenomics for Microbiological Diagnostics
2022cites this paper
Comparison of destructive and nondestructive DNA extraction methods for the metabarcoding of arthropod bulk samples
2022cites this paper
Metagenomic profiling pipelines improve taxonomic classification for 16S amplicon sequencing data
2022cites this paper
Biomonitoring of Fungal and Oomycete Plant Pathogens by Using Metabarcoding.
2022cites this paper
Deep Learning Encoding for Rapid Sequence Identification on Microbiome Data
2022cites this paper
Diversity Patterns of Protists Are Highly Affected by Methods Disentangling Biological Variants: A Case Study in Oligotrich (s.l.) Ciliates
2022cites this paper
High serum granulocyte-colony stimulating factor characterises neutrophilic COPD exacerbations associated with dysbiosis
2021cites this paper
Sea urchin microbiomes vary with habitat and resource availability
2021cites this paper
Dataset for "To denoise or to cluster? That is not the question. Optimizing pipelines for COI metabarcoding and metaphylogeography
2021cites this paper
Analysis of endometrial microbiota in intrauterine adhesion by high-throughput sequencing
2021cites this paper
To denoise or to cluster, that is not the question: optimizing pipelines for COI metabarcoding and metaphylogeography
2021cites this paper
An Insight into Vaginal Microbiome Techniques
2021cites this paper
Microbial community structure and metabolic potential in the coastal sediments around the Yellow River Estuary.
2021cites this paper
Diversity Patterns of Protists are Highly Affected by Methods Disentangling Inter-specific Variants: A Case Study in Oligotrich (s.l.) Ciliates
2021cites this paper
Application of artificial intelligence in microbiome study promotes precision medicine for gastric cancer
2021cites this paper
Optimal sequence similarity thresholds for clustering of molecular operational taxonomic units in DNA metabarcoding studies
2021cites this paper
MCRL: using a reference library to compress a metagenome into a non-redundant list of sequences, considering viruses as a case study
2021cites this paper
Updating Urinary Microbiome Analyses to Enhance Biologic Interpretation
2021cites this paper
Dietary and Pharmacologic Manipulations of Host Lipids and Their Interaction With the Gut Microbiome in Non-human Primates
2021cites this paper
Assessing the Relationship Between Nitrate-Reducing Capacity of the Oral Microbiome and Systemic Outcomes.
2021cites this paper
Metagenomic Approach in Relation to Plant–Microbe and Microbe–Microbe Interactions
2021cites this paper
Identifying biases and their potential solutions in human microbiome studies
2021cites this paper
Effect of heavy metal-induced stress on two extremophilic microbial communities from Caviahue-Copahue, Argentina.
2020cites this paper
Vitamin D supplementation in pregnancy and early infancy in relation to gut microbiota composition and C. difficile colonization: implications for viral respiratory infections
2020cites this paper
Changes in Vaginal Microbiome in Pregnant and Nonpregnant Women with Bacterial Vaginosis: Toward Microbiome Diagnostics?
2020cites this paper
Biological observations in microbiota analysis are robust to the choice of 16S rRNA gene sequencing processing algorithm: case study on human milk microbiota
2020cites this paper
Spider phylosymbiosis: divergence of widow spider species and their tissues’ microbiomes
2020cites this paper
Phyllosphere bacterial assembly in citrus crop under conventional and ecological management
2020cites this paper
Microdiversity and phylogeographic diversification of bacterioplankton in pelagic freshwater systems revealed through long-read amplicon sequencing
2020cites this paper
The promise and challenge of cancer microbiome research
2020cites this paper
Community members in activated sludge as determined by molecular probe technology.
2020cites this paper
AmpliCI: a high-resolution model-based approach for denoising Illumina amplicon data
2020cites this paper
A framework for assessing 16S rRNA marker-gene survey data analysis methods using mixtures.
2020cites this paper
High-throughput sequencing and food microbiology.
2020cites this paper
Phenotype Prediction from Metagenomic Data Using Clustering and Assembly with Multiple Instance Learning (CAMIL)
2020cites this paper
Grazing exclusion regulates bacterial community in highly degraded semiarid soils from the Brazilian Caatinga biome
2020cites this paper
Revisiting Plant–Microbe Interactions and Microbial Consortia Application for Enhancing Sustainable Agriculture: A Review
2020cites this paper
Mechanisms governing avian phylosymbiosis: Genetic dissimilarity based on neutral and MHC regions exhibits little relationship with gut microbiome distributions of Galápagos mockingbirds
2020influential citation
Cascabel: A Scalable and Versatile Amplicon Sequence Data Analysis Pipeline Delivering Reproducible and Documented Results
2020influential citation
Influence of Acacia mangium on Soil Fertility and Bacterial Community in Eucalyptus Plantations in the Congolese Coastal Plains
2020cites this paper
Targeted Informatics for Optimal Detection, Characterization, and Quantification of FLT3 Internal Tandem Duplications Across Multiple Next-Generation Sequencing Platforms
2020cites this paper
Interpretations of Environmental Microbial Community Studies Are Biased by the Selected 16S rRNA (Gene) Amplicon Sequencing Pipeline
2020cites this paper
Experimental old nest material predicts hoopoe Upupa epops eggshell and uropygial gland microbiota
2019cites this paper
A review of methods and databases for metagenomic classification and assembly
2019cites this paper
Are We Overestimating Protistan Diversity in Nature?
2019cites this paper
Soil Bacterial Community Associated With High Potato Production and Minimal Water Use
2019cites this paper
Rates and Pathways of N2 Production in a Persistently Anoxic Fjord: Saanich Inlet, British Columbia
2019cites this paper
Performance of Microbiome Sequence Inference Methods in Environments with Varying Biomass
2019cites this paper
Understanding and overcoming the pitfalls and biases of next-generation sequencing (NGS) methods for use in the routine clinical microbiological diagnostic laboratory
2019cites this paper
Diversity and shifts of the bacterial community associated with Baikal sponge mass mortalities
2019cites this paper
Mixed Eucalyptus plantations induce changes in microbial communities and increase biological functions in the soil and litter layers
2019cites this paper
Intra-species diversity ensures the maintenance of functional microbial communities under changing environmental conditions
2019cites this paper
RAFTS3G: an efficient and versatile clustering software to analyses in large protein datasets
2019cites this paper
Stability of temperate coral Astrangia poculata microbiome is reflected across different sequencing methodologies
2019cites this paper
A Risky Business? Habitat and Social Behavior Impact Skin and Gut Microbiomes in Caribbean Cleaning Gobies
2019cites this paper
What is new and relevant for sequencing-based microbiome research? A mini-review
2019cites this paper
Analysis of oral bacterial communities: comparison of HOMINGS with a tree-based approach implemented in QIIME
2019influential citation
ANCHOR: a 16S rRNA gene amplicon pipeline for microbial analysis of multiple environmental samples
2019influential citation
FIGARO: An efficient and objective tool for optimizing microbiome rRNA gene trimming parameters
2019cites this paper
Aedes albopictus mosquitoes host a locally structured mycobiota with evidence of reduced fungal diversity in invasive populations
2019cites this paper
Integrating geochemical and microbiological information for better modeling of the N-cycle – past and present
2019cites this paper
The Contribution of Genomics to Bird Conservation
2019cites this paper
Influence of 16S rRNA variable region on perceived diversity of marine microbial communities of the Northern North Atlantic
2019cites this paper
Fecal microbiome and metabolome of infants fed bovine MFGM supplemented formula or standard formula with breast-fed infants as reference: a randomized controlled trial
2019cites this paper
Insight Into the Microbial Co-occurrence and Diversity of 73 Grapevine (Vitis vinifera) Crown Galls Collected Across the Northern Hemisphere
2019cites this paper
Interaction between high-fat diet and ethanol intake leads to changes on the fecal microbiome.
2019cites this paper
Cascabel: a flexible, scalable and easy-to-use amplicon sequence data analysis pipeline
2019cites this paper
Microbiome and imputed metagenome study of crude and refined petroleum-oil-contaminated soils: Potential for hydrocarbon degradation and plant-growth promotion
2019cites this paper
Repeated mild traumatic brain injury affects microbial diversity in rat jejunum
2019cites this paper
A Bioinformatics Guide to Plant Microbiome Analysis
2019cites this paper