Iterative error correction of long sequencing reads maximizes accuracy and improves contig assembly

Katrin Sameith,Juliana G. Roscito,M. Hiller

Published 2016 in Briefings Bioinform.

ABSTRACT

Abstract Next‐generation sequencers such as Illumina can now produce reads up to 300 bp with high throughput, which is attractive for genome assembly. A first step in genome assembly is to computationally correct sequencing errors. However, correcting all errors in these longer reads is challenging. Here, we show that reads with remaining errors after correction often overlap repeats, where short erroneous k ‐mers occur in other copies of the repeat. We developed an iterative error correction pipeline that runs the previously published String Graph Assembler (SGA) in multiple rounds of k ‐mer‐based correction with an increasing k ‐mer size, followed by a final round of overlap‐based correction. By combining the advantages of small and large k ‐mers, this approach corrects more errors in repeats and minimizes the total amount of erroneous reads. We show that higher read accuracy increases contig lengths two to three times. We provide SGA‐Iteratively Correcting Errors (https://github.com/hillerlab/IterativeErrorCorrection/) that implements iterative error correction by using modules from SGA.

PUBLICATION RECORD

Publication year
2016
Venue
Briefings Bioinform.
Publication date
2016-02-10
Fields of study
Biology, Medicine, Computer Science
Identifiers
DOI 10.1093/bib/bbw003 PMID 26868358 PMCID 5221426
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar, PubMed

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Correcting Illumina data
2015cited by this paper
Karect: accurate correction of substitution, insertion and deletion errors for next-generation sequencing data
2015cited by this paper
BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs
2015cited by this paper
BFC: correcting Illumina sequencing errors
2015cited by this paper
Denoising DNA deep sequencing data—high-throughput sequencing errors and their correction
2015cited by this paper
Fiona: a parallel and automatic strategy for read error correction
2014cited by this paper
The UCSC Genome Browser database: 2015 update
2014cited by this paper
Lighter: fast and memory-efficient sequencing error correction without counting
2014cited by this paper
BLESS: Bloom filter-based error correction solution for high-throughput sequencing reads
2014cited by this paper
Genome analysis and signature discovery for diving and sensory properties of the endangered Chinese alligator
2013cited by this paper
A survey of error-correction methods for next-generation sequencing
2013cited by this paper
RACER: Rapid and accurate correction of errors in reads
2013cited by this paper
Musket: a multistage k-mer spectrum-based error corrector for Illumina sequence data
2013cited by this paper
pIRS: Profile-based Illumina pair-end reads simulator
2012cited by this paper
Fast gapped-read alignment with Bowtie 2
2012cited by this paper
Efficient de novo assembly of large genomes using compressed data structures.
2012cited by this paper
ART: a next-generation sequencing read simulator
2012cited by this paper
ECHO: a reference-free short-read error correction algorithm.
2011cited by this paper
HiTEC: accurate error correction in high-throughput sequencing data
2011cited by this paper
An Efficient Hybrid Approach to Correcting Errors in Short Reads
2011cited by this paper
Repetitive DNA and next-generation sequencing: computational challenges and solutions
2011cited by this paper
Error correction of high-throughput sequencing datasets with non-uniform coverage
2011cited by this paper
Efficient construction of an assembly string graph using the FM-index
2010cited by this paper
Quake: quality-aware detection and correction of sequencing errors
2010cited by this paper
Reptile: representative tiling for short read error correction
2010cited by this paper
De novo fragment assembly with short mate-paired reads: Does the read length matter?
2009cited by this paper
SHREC: a short-read error correction method
2009cited by this paper
An Eulerian path approach to DNA fragment assembly
2001cited by this paper
BIOINFORMATICS ORIGINAL PAPER
year unknowncited by this paper

CITED BY

PGIP: a web server for the rapid taxonomic identification of parasite genomes
2025cites this paper
Dynamics of soil microbial communities involved in carbon cycling along three successional forests in southern China
2024cites this paper
Advances in long-read single-cell transcriptomics
2024cites this paper
The genome of a Far Eastern isolate of Diaporthe caulivora, a soybean fungal pathogen
2023cites this paper
k-mer analysis shows hybrid hummingbirds perform variable, transgressive courtship sequences
2022cites this paper
Convergent and lineage-specific genomic differences in limb regulatory elements in limbless reptile lineages.
2022cites this paper
Bi-Level Error Correction for PacBio Long Reads
2020cites this paper
Vertical lossless genomic data compression tools for assembled genomes: A systematic literature review
2020cites this paper
Hemocyanin of the caenogastropod Pomacea canaliculata exhibits evolutionary differences among gastropod clades
2020cites this paper
Rapid inference of antibiotic resistance and susceptibility by genomic neighbour typing
2020cites this paper
Phenotype Prediction of DNA Sequence Data: A Machine- and Statistical Learning Approach
2020cites this paper
Lessons from culturing lichen soredia
2020cites this paper
Illuminating an Ecological Blackbox: Using High Throughput Sequencing to Characterize the Plant Virome Across Scales
2020cites this paper
Deep Neural Network: An Efficient and Optimized Machine Learning Paradigm for Reducing Genome Sequencing Error
2020cites this paper
Efficient Mining Multi-Mers in a Variety of Biological Sequences
2020cites this paper
Insights into Red Sea Brine Pool Specialized Metabolism Gene Clusters Encoding Potential Metabolites for Biotechnological Applications and Extremophile Survival
2019cites this paper
A Sequence-Based Novel Approach for Quality Evaluation of Third-Generation Sequencing Reads
2019cites this paper
Natrarchaeobius chitinivorans gen. nov., sp. nov., and Natrarchaeobius halalkaliphilus sp. nov., alkaliphilic, chitin-utilizing haloarchaea from hypersaline alkaline lakes
2019cites this paper
Gut Microbiota and Predicted Metabolic Pathways in a Sample of Mexican Women Affected by Obesity and Obesity Plus Metabolic Syndrome
2019cites this paper
A Selective Review of Multi-Level Omics Data Integration Using Variable Selection
2019cites this paper
Genomic Analysis of Pseudomonas sp. Strain SCT, an Iodate-Reducing Bacterium Isolated from Marine Sediment, Reveals a Possible Use for Bioremediation
2019cites this paper
Incorporating Pathway Information into Feature Selection towards Better Performed Gene Signatures
2019cites this paper
Genetic diversity and population structure of Glossina morsitans morsitans in the active foci of human African trypanosomiasis in Zambia and Malawi
2019cites this paper
Raoultella bacteriophage RP180, a new member of the genus Kagunavirus, subfamily Guernseyvirinae
2019cites this paper
Athena: Automated Tuning of k-mer based Genomic Error Correction Algorithms using Language Models
2019cites this paper
Genomic Analysis of γ-Hexachlorocyclohexane-Degrading Sphingopyxis lindanitolerans WS5A3p Strain in the Context of the Pangenome of Sphingopyxis
2019cites this paper
An improved approach to infer protein-protein interaction based on a hierarchical vector space model
2018cites this paper
Diverse roles of RAD18 and Y-family DNA polymerases in tumorigenesis
2018cites this paper
Using an optimal set of features with a machine learning-based approach to predict effector proteins for Legionella pneumophila
2018cites this paper
Interpretation of differential gene expression results of RNA-seq data: review and integration
2018cites this paper
High-Throughput Identification of Mammalian Secreted Proteins Using Species-Specific Scheme and Application to Human Proteome
2018cites this paper
A benchmark study of k-mer counting methods for high-throughput sequencing
2018cites this paper
CLIPick: a sensitive peak caller for expression-based deconvolution of HITS-CLIP signals
2018cites this paper
The genome of the tegu lizard Salvator merianae: combining Illumina, PacBio, and optical mapping data to generate a highly contiguous assembly
2018cites this paper
Inside Plectosphaerellaceae
2018cites this paper
ATHENA: Automated Tuning of Genomic Error Correction Algorithms using Language Models
2018cites this paper
Four selenoprotein P genes exist in salmonids: Analysis of their origin and expression following Se supplementation and bacterial infection
2018cites this paper
Kourami: graph-guided assembly for novel human leukocyte antigen allele discovery
2018cites this paper
BACTpipe: CHARACTERIZATION OF BACTERIAL ISOLATES BASED ON WHOLE-GENOME SEQUENCE DATA
2018cites this paper
Secure Logistic Regression Based on Homomorphic Encryption: Design and Evaluation
2018cites this paper
The cup fungus Pestalopezia brunneopruinosa is Pestalotiopsis gibbosa and belongs to Sordariomycetes
2018cites this paper
Metagenomic Approaches for Understanding New Concepts in Microbial Science
2018cites this paper
p-Value Histograms: Inference and Diagnostics
2018cites this paper
Computational Strategies for Dissecting the High-Dimensional Complexity of Adaptive Immune Repertoires
2017cites this paper
Alignment-free inference of hierarchical and reticulate phylogenomic relationships
2017cites this paper
Computational Errors and Biases in Short Read Next Generation Sequencing
2017cites this paper
MapReduce for accurate error correction of next-generation sequencing data
2017cites this paper
Phenotype loss is associated with widespread divergence of the gene regulatory landscape in evolution
2017cites this paper
Sequence analysis MapReduce for accurate error correction of next-generation sequencing data
2017cites this paper
Evaluation Of Background Prediction For Variant Detection In A Clinical Context: Towards Improved Ngs Monitoring Of Minimal Residual Disease In Hematological Malignancies
2017cites this paper
Gerbil: a fast and memory-efficient k-mer counter with GPU-support
2016cites this paper
Computational problems of analysis of short next generation sequencing reads
2016cites this paper