Performance comparisons between clustering models for reconstructing NGS results from technical replicates

Yue Zhai,C. Bardel,M. Vallée,J. Iwaz,P. Roy

Published 2023 in Frontiers in Genetics

ABSTRACT

To improve the performance of individual DNA sequencing results, researchers often use replicates from the same individual and various statistical clustering models to reconstruct a high-performance callset. Here, three technical replicates of genome NA12878 were considered and five model types were compared (consensus, latent class, Gaussian mixture, Kamila–adapted k-means, and random forest) regarding four performance indicators: sensitivity, precision, accuracy, and F1-score. In comparison with no use of a combination model, i) the consensus model improved precision by 0.1%; ii) the latent class model brought 1% precision improvement (97%–98%) without compromising sensitivity (= 98.9%); iii) the Gaussian mixture model and random forest provided callsets with higher precisions (both >99%) but lower sensitivities; iv) Kamila increased precision (>99%) and kept a high sensitivity (98.8%); it showed the best overall performance. According to precision and F1-score indicators, the compared non-supervised clustering models that combine multiple callsets are able to improve sequencing performance vs. previously used supervised models. Among the models compared, the Gaussian mixture model and Kamila offered non-negligible precision and F1-score improvements. These models may be thus recommended for callset reconstruction (from either biological or technical replicates) for diagnostic or precision medicine purposes.

PUBLICATION RECORD

Publication year
2023
Venue
Frontiers in Genetics
Publication date
2023-03-16
Fields of study
Biology, Medicine, Computer Science
Identifiers
DOI 10.3389/fgene.2023.1148147 PMID 37007945 PMCID 10060969
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar, PubMed

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Assessing reproducibility of inherited variants detected with short-read whole genome sequencing
2022cited by this paper
Benchmarking challenging small variants with linked and long reads
2020cited by this paper
Accuracy and efficiency of germline variant calling pipelines for human genome data
2020cited by this paper
SomaticCombiner: improving the performance of somatic variant calling based on evaluation tests and a consensus approach
2020influential reference
isma: an R package for the integrative analysis of mutations detected by multiple pipelines
2019cited by this paper
Predicting the Number of Bases to Attain Sufficient Coverage in High-Throughput Sequencing Experiments
2019cited by this paper
Best practices for benchmarking germline small-variant calls in human genomes
2019cited by this paper
SMuRF: portable and accurate ensemble prediction of somatic mutations
2019cited by this paper
The ENCODE Blacklist: Identification of Problematic Regions of the Genome
2019cited by this paper
Model-Based Clustering and Classification Using Mixtures of Multivariate Skewed Power Exponential Distributions
2019cited by this paper
Comparative analysis of whole-genome sequencing pipelines to minimize false negative findings
2019cited by this paper
A synthetic-diploid benchmark for accurate variant calling evaluation
2018cited by this paper
Allele balance bias identifies systematic genotyping errors and false disease associations
2018cited by this paper
kamila: Clustering Mixed-Type Data in R and Hadoop
2018cited by this paper
CoVaCS: a consensus variant calling system
2018cited by this paper
appreci8: a pipeline for precise variant calling integrating 8 tools
2018cited by this paper
mclust 5: Clustering, Classification and Density Estimation Using Gaussian Finite Mixture Models
2016cited by this paper
A semiparametric method for clustering mixed data
2016cited by this paper
Genome measures used for quality control are dependent on gene function and ancestry
2015cited by this paper
Extensive sequencing of seven human genomes to characterize benchmark reference materials
2015influential reference
Consensus Genotyper for Exome Sequencing (CGES): improving the quality of exome variant genotypes
2015cited by this paper
VariantMetaCaller: automated fusion of variant calling pipelines for quantitative, precision-based filtering
2015cited by this paper
Modeling genome coverage in single-cell sequencing
2014cited by this paper
BAYSIC: a Bayesian method for combining sets of genome variants with improved specificity and sensitivity
2014cited by this paper
Three-stage quality control strategies for DNA re-sequencing data
2014cited by this paper
Toward better understanding of artifacts in variant calling from high-coverage samples
2014cited by this paper
Reducing False‐Positive Incidental Findings with Ensemble Genotyping and Logistic Regression Based Variant Filtering Methods
2014cited by this paper
The role of replicates for error mitigation in next-generation sequencing
2013cited by this paper
Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM
2013cited by this paper
Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls
2013cited by this paper
Low concordance of multiple variant-calling pipelines: practical implications for exome and genome sequencing
2013cited by this paper
poLCA: An R Package for Polytomous Variable Latent Class Analysis
2011cited by this paper
A framework for variation discovery and genotyping using next-generation DNA sequencing data
2011cited by this paper
The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data.
2010cited by this paper
edgeR: a Bioconductor package for differential expression analysis of digital gene expression data
2009cited by this paper
Classification and Regression by randomForest
2007cited by this paper
Unsupervised Learning With Random Forest Predictors
2006cited by this paper
Building an identifiable latent class model with covariate effects on underlying and measured variables
2004cited by this paper