Extensive Error in the Number of Genes Inferred from Draft Genome Assemblies

James F. Denton,Jose Lugo-Martinez,Abraham E. Tucker,Daniel R. Schrider,W. Warren,Matthew W. Hahn

Published 2014 in PLoS Comput. Biol.

ABSTRACT

Current sequencing methods produce large amounts of data, but genome assemblies based on these data are often woefully incomplete. These incomplete and error-filled assemblies result in many annotation errors, especially in the number of genes present in a genome. In this paper we investigate the magnitude of the problem, both in terms of total gene number and the number of copies of genes in specific families. To do this, we compare multiple draft assemblies against higher-quality versions of the same genomes, using several new assemblies of the chicken genome based on both traditional and next-generation sequencing technologies, as well as published draft assemblies of chimpanzee. We find that upwards of 40% of all gene families are inferred to have the wrong number of genes in draft assemblies, and that these incorrect assemblies both add and subtract genes. Using simulated genome assemblies of Drosophila melanogaster, we find that the major cause of increased gene numbers in draft genomes is the fragmentation of genes onto multiple individual contigs. Finally, we demonstrate the usefulness of RNA-Seq in improving the gene annotation of draft assemblies, largely by connecting genes that have been fragmented in the assembly process.

PUBLICATION RECORD

Publication year
2014
Venue
PLoS Comput. Biol.
Publication date
2014-12-01
Fields of study
Biology, Medicine, Computer Science
Identifiers
DOI 10.1371/journal.pcbi.1003998 PMID 25474019 PMCID 4256071
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar, PubMed

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Nucleic Acids Research
2015cited by this paper
Sequencing, Assembling, and Correcting Draft Genomes Using Recombinant Populations
2014cited by this paper
Finding the missing honey bee genes: lessons learned from a genome upgrade
2014cited by this paper
Toward a statistically explicit understanding of de novo sequence assembly
2013cited by this paper
Estimating gene gain and loss rates in the presence of error in genome assembly and annotation using CAFE 3.
2013cited by this paper
Genomic analyses identify distinct patterns of selection in domesticated pigs and Tibetan wild boars
2013cited by this paper
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. REAPR: a universal tool for genome assembly evaluation
2013cited by this paper
Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species
2013cited by this paper
L_RNA_scaffolder: scaffolding genomes with transcripts
2013cited by this paper
A physical, genetic and functional sequence assembly of the barley genome
2012cited by this paper
Limitations of the rhesus macaque draft genome assembly and annotation
2012influential reference
RNA-Seq Based Transcriptional Map of Bovine Respiratory Disease Pathogen “Histophilus somni 2336”
2012cited by this paper
Gene fragmentation in bacterial draft genomes: extent, consequences and mitigation
2012cited by this paper
The Paleozoic Origin of Enzymatic Lignin Decomposition Reconstructed from 31 Fungal Genomes
2012cited by this paper
Efficient de novo assembly of large genomes using compressed data structures.
2012cited by this paper
FlyBase: improvements to the bibliography
2012influential reference
The yak genome and adaptation to life at high altitude
2012cited by this paper
Assessing pooled BAC and whole genome shotgun strategies for assembly of complex genomes
2011cited by this paper
Comparative genomics approach to detecting split-coding regions in a low-coverage genome: lessons from the chimaera Callorhinchus milii (Holocephali, Chondrichthyes)
2011cited by this paper
Genome-wide analysis of retrogene polymorphisms in Drosophila melanogaster.
2011cited by this paper
The Ecoresponsive Genome of Daphnia pulex
2011cited by this paper
Considerations for the inclusion of 2x mammalian genomes in phylogenetic analyses
2011cited by this paper
Ensembl 2012
2011cited by this paper
Genome Assembly Has a Major Impact on Gene Content: A Comparison of Annotation in Two Bos Taurus Assemblies
2011cited by this paper
The Genomes OnLine Database (GOLD) v.4: status of genomic and metagenomic projects and their associated metadata
2011cited by this paper
Error and Error Mitigation in Low-Coverage Genome Assemblies
2011cited by this paper
Limitations of next-generation genome sequence assembly
2011influential reference
A vertebrate case study of the quality of assemblies derived from next-generation sequences
2011influential reference
RNA-Seq improves annotation of protein-coding genes in the cucumber genome
2011cited by this paper
Scaffolding a Caenorhabditis nematode genome with RNA-seq.
2010cited by this paper
Origins and functional impact of copy number variation in the human genome
2010cited by this paper
The developmental transcriptome of Drosophila melanogaster
2010cited by this paper
The Developmental Transcriptome of Drosophila melanogaster
2010cited by this paper
Genome assembly quality: assessment and improvement using the neutral indel model.
2010cited by this paper
The Sequence Alignment/Map format and SAMtools
2009cited by this paper
All Human-Specific Gene Losses Are Present in the Genome as Pseudogenes
2009cited by this paper
Lineage-Specific Biology Revealed by a Finished Genome Assembly of the Mouse
2009cited by this paper
Limitations of Pseudogenes in Identifying Gene Losses
2008cited by this paper
Natural Selection Shapes Genome-Wide Patterns of Copy-Number Polymorphism in Drosophila melanogaster
2008cited by this paper
Assessing the gene space in draft genomes
2008cited by this paper
Genome assembly forensics: finding the elusive mis-assembly
2008cited by this paper
Gene-Boosted Assembly of a Novel Bacterial Genome from Very Short Reads
2008cited by this paper
Annotating genomes with massive-scale RNA sequencing
2008cited by this paper
The Genomes On Line Database (GOLD) in 2007: status of genomic and metagenomic projects and their associated metadata
2008cited by this paper
A machine-learning approach to combined evidence validation of genome assemblies
2008cited by this paper
Lessons learned from the initial sequencing of the pig genome: comparative analysis of an 8 Mb region of pig chromosome 17
2007cited by this paper
CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes
2007cited by this paper
Diet and the evolution of human amylase gene copy number variation
2007cited by this paper
Discovery of functional elements in 12 Drosophila genomes using evolutionary signatures
2007cited by this paper
MAKER: an easy-to-use annotation pipeline designed for emerging model organism genomes.
2007cited by this paper
Gene Family Evolution across 12 Drosophila Genomes
2007cited by this paper
xGDB: open-source computational infrastructure for the integrated evaluation and analysis of genome features
2006cited by this paper
Linking the human cytogenetic map with nucleotide sequence: the CCAP clone set.
2006cited by this paper
yrGATE: a web-based gene-structure annotation tool for the identification and dissemination of eukaryotic genes
2006cited by this paper
The Evolution of Mammalian Gene Families
2006cited by this paper
Physical map-assisted whole-genome shotgun sequence assemblies.
2006cited by this paper
Comparative Genomics in Eukaryotes
2005cited by this paper
Initial sequence of the chimpanzee genome and comparison with the human genome
2005cited by this paper
Assembly of polymorphic genomes: algorithms and application to Ciona savignyi.
2005cited by this paper
wFleaBase: the Daphnia genome database
2005cited by this paper
Sequence and comparative analysis of the chicken genome provide unique perspectives on vertebrate evolution
2004cited by this paper
Comparative genome assembly
2004cited by this paper
The diploid genome sequence of Candida albicans.
2004cited by this paper
Sequence and comparative analysis of the chicken genome provide unique perspectives on vertebrate evolution
2004cited by this paper
A Low Number Wins the GeneSweep Pool
2003cited by this paper
Gene prediction with a hidden Markov model and a new intron submodel
2003cited by this paper
Nucleotide Sequence Database Policies
2002cited by this paper
An efficient algorithm for large-scale detection of protein families.
2002cited by this paper
The Genome Sequence of the Malaria Mosquito Anopheles gambiae
2002cited by this paper
Microbial Genes in the Human Genome: Lateral Transfer or Gene Loss?
2001cited by this paper
Initial sequencing and analysis of the human genome
2001cited by this paper
Ab initio gene finding in Drosophila genomic DNA.
2000cited by this paper
A whole-genome assembly of Drosophila.
2000cited by this paper
Comparative genomics of the eukaryotes.
2000cited by this paper
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.
1997cited by this paper
Prediction of complete gene structures in human genomic DNA.
1997cited by this paper
A physical, genetic and functional sequence assembly of the barley genome.
year unknowncited by this paper
BIOINFORMATICS ORIGINAL PAPER
year unknowncited by this paper

CITED BY

Integrative Multi-Omics Characterization and Structural Insights into the Poorly Annotated Integrin ITGA6 X1X2 Isoform in Mammals
2025cites this paper
Odorant-binding proteins (OBPs) in Ceratitis capitata Wiedemann (Diptera: Tephritidae): integrative analysis of a multigene family
2025cites this paper
Patterns of Gene Family Evolution and Selection Across Daphnia
2025cites this paper
Tips for improving genome annotation quality
2025cites this paper
LRMD: Reference-Free Misassembly Detection Based on Multiple Features from Long-Read Alignments
2025cites this paper
Trypanosoma cruzi surface components: Why so many? Why so polymorphic?
2025cites this paper
Evaluating Genome Assemblies for Optimized Completeness and Accuracy of Reference Gene Sequences in Wheat, Rye, and Triticale
2025cites this paper
Genome Evolution of Two Intertidal Sargassum Species (S. fusiforme and S. thunbergii) and Their Response to Abiotic Stressors
2025cites this paper
Characterization of a MERS-related betacoronavirus in Danish brown long-eared bats (Plecotus auritus)
2025cites this paper
Diploid chromosome-level genome assembly and annotation for Lycorma delicatula
2025cites this paper
Comparative Genomics of Zoonotic Pathogens: Genetic Determinants of Host Switching and Cross-Species Transmission
2025cites this paper
Draft whole genome sequence of Paenibacillus sp., a novel facultative anaerobic bacteria isolated from an inflammatory bowel disease patient
2025cites this paper
GapSense: Similarity Estimation-Based Gap Filler with TGS-Reads for Genome Assemblies.
2025cites this paper
Genomic and secretomic analyses of Blastobotrys yeasts reveal key xylanases for biomass decomposition
2025cites this paper
Beyond Methane Oxidation: The Protein Landscape of ANME‐2a Reveals an Integrated System for Diazotrophy and Membrane Fortification
2025cites this paper
Genomic insights into persistent infections, reinfections, and subspecies diversity of Mycobacteroides abscessus: A whole-genome sequencing study of Thai and global isolates.
2025cites this paper
Multi-metric locality sensitive hashing enhances alignment accuracy of bisulfite sequencing reads: BisHash
2025cites this paper
CGC1, a new reference genome for Caenorhabditis elegans
2025cites this paper
Standard methods and good practices in Apis honey bee omics research
2025cites this paper
A chromosome-level genome assembly of the Rhus gall aphid Schlechtendalia chinensis provides insight into the endogenization of Parvovirus-like DNA sequences
2024cites this paper
Degeneration of the Olfactory System in a Murid Rodent that Evolved Diurnalism
2024cites this paper
Higher evolutionary dynamics of gene copy number for Drosophila glue genes located near short repeat sequences
2024cites this paper
toGC: a pipeline to correct gene model for functional excavation of dark GPCRs in Phytophthora sojae1
2024cites this paper
Problems with paralogs: the promise and challenges of gene duplicates in evo-devo research.
2024cites this paper
Genome assembly of the southern pine beetle (Dendroctonus frontalis Zimmerman) reveals the origins of gene content reduction in Dendroctonus
2024cites this paper
Application of orthology and network biology to infer gene functions in non-model plants.
2024cites this paper
A two-sequence motif-based method for the inventory of gene families in fragmented and poorly annotated genome sequences
2024cites this paper
Pathogen elicitor peptide (pep), systemin, and their receptors in tomato: sequence analysis sheds light on standing disagreements about biotic stress signaling components
2024cites this paper
Chromosome-scale reference genome of an ancient landrace: unveiling the genetic basis of seed weight in the food legume crop pigeonpea (Cajanus cajan)
2024cites this paper
Genome assembly variation and its implications for gene discovery in nematode species
2024influential citation
Genome assembly of the southern pine beetle (Dendroctonus frontalis Zimmerman) reveals the origins of gene content reduction in Dendroctonus
2024cites this paper
Comparative Genomics of the World's Smallest Mammals Reveals Links to Echolocation, Metabolism, and Body Size Plasticity
2024cites this paper
BakRep – a searchable large-scale web repository for bacterial genomes, characterizations and metadata
2024cites this paper
Persistent, Private, and Mobile Genes: A Model for Gene Dynamics in Evolving Pangenomes
2024cites this paper
Exploring the pangenome landscape of Mycobacterium avium complex: insights into phylogeny and lifestyle
2024cites this paper
From Chaos Comes Order: Genetics and Genome Biology of Arbuscular Mycorrhizal Fungi.
2024cites this paper
CGC1, a new reference genome for Caenorhabditis elegans
2024cites this paper
Single-worm long-read sequencing reveals genome diversity in free-living nematodes
2023cites this paper
A comprehensive evolutionary scenario for the origin and neofunctionalization of the Drosophila speciation gene Odysseus (OdsH)
2023cites this paper
Protein length distribution is remarkably uniform across the tree of life
2023cites this paper
The NIH Comparative Genomics Resource: addressing the promises and challenges of comparative genomics on human health
2023cites this paper
Large-scale genome investigations reveal insights into domestication of cultivated mushrooms
2023cites this paper
Challenges in prokaryote pangenomics
2023cites this paper
An improved reference genome and first organelle genomes of Quercus suber
2023cites this paper
Genome-wide analyses of Glutathione S-transferase gene family and expression profiling under deltamethrin exposure in non-biting midge Propsilocerus akamusi.
2023cites this paper
Genome analysis of a halophilic Virgibacillus halodenitrificans ASH15 revealed salt adaptation, plant growth promotion, and isoprenoid biosynthetic machinery
2023cites this paper
Unravelling the Evolutionary Dynamics of High-Risk Klebsiella pneumoniae ST147 Clones: Insights from Comparative Pangenome Analysis
2023cites this paper
GALA: a computational framework for de novo chromosome-by-chromosome assembly with long reads
2023cites this paper
Comparison of de novo and reference genome-based transcriptome assembly pipelines for differential expression analysis of RNA sequencing data
2022cites this paper
The Evolution of Bivalve Shell Matrix Proteins
2022cites this paper
RResolver: efficient short-read repeat resolution within ABySS
2022cites this paper
Revised eutherian gene collections
2022cites this paper
Evolution of Helicobacter spp: variability of virulence factors and their relationship to pathogenicity
2022cites this paper
Comparative Genomics Provides Insights Into Genetic Diversity of Clostridium tyrobutyricum and Potential Implications for Late Blowing Defects in Cheese
2022cites this paper
Exploring bacterial diversity via a curated and searchable snapshot of archived DNA
2022cites this paper
Assessment of selection pressure exerted on genes from complete pangenomes helps to improve the accuracy in the prediction of new genes
2022cites this paper
Genetic Features of Mycobacterium avium subsp. paratuberculosis Strains Circulating in the West of France Deciphered by Whole-Genome Sequencing
2022cites this paper
Horizontal transfer of the rfb cluster in Leptospira is a genetic determinant of serovar identity
2022cites this paper
Sequence-based pangenomic core detection
2022cites this paper
Whole-Genome Sequence Comparisons of Listeria monocytogenes Isolated from Meat and Fish Reveal High Inter- and Intra-Sample Diversity
2022cites this paper
Exploring bacterial diversity via a curated and searchable snapshot of archived DNA sequences
2021cites this paper
The key to egress? Babesia bovis perforin-like protein 1 (PLP1) with hemolytic capacity is required for blood stage replication and is involved in the exit of the parasite from the host cell.
2021cites this paper
Restriction of an intron size en route to endothermy
2021cites this paper
A first insight into the genome of Prototheca wickerhamii, a major causative agent of human protothecosis
2021cites this paper
Evolutionary Genomic and Bacterial Genome-Wide Association Study of Mycobacterium avium subsp. paratuberculosis and Dairy Cattle Johne's Disease Phenotypes
2021cites this paper
The genome of the zebra mussel, Dreissena polymorpha: a resource for comparative genomics, invasion genetics, and biocontrol
2021cites this paper
Helicobacter pylori virulence factors: relationship between genetic variability and phylogeographic origin
2021cites this paper
Annotation of Hox cluster and Hox cofactor genes in the Asian citrus psyllid, Diaphorina citri, reveals novel features
2021cites this paper
Protein length distribution is remarkably consistent across Life
2021cites this paper
Genome-wide analyses of ATP-Binding Cassette (ABC) transporter gene family and its expression profile related to deltamethrin tolerance in non-biting midge Propsilocerus akamusi.
2021cites this paper
GFICLEE: ultrafast tree-based phylogenetic profile method inferring gene function at the genomic-wide level
2021cites this paper
Evolutionary Genomic and Bacterial Genome-Wide Association Study of <named-content content-type='genus-species'>Mycobacterium avium</named-content> subsp. <italic toggle='yes'>paratuberculosis</italic> and Dairy Cattle Johnes Disease Phenotypes
2021cites this paper
Use of Artificial Intelligence in Research and Clinical Decision Making for Combating Mycobacterial Diseases
2021cites this paper
Evolution of Toll, Spatzle and MyD88 in insects: the problem of the Diptera bias
2021cites this paper
Genome assembly of a Mesoamerican derived variety of lima bean: a foundational cultivar in the Mid-Atlantic USA
2021cites this paper
Parameter exploration improves the accuracy of long-read genome assembly
2021cites this paper
Chromosome-level genome assembly and manually-curated proteome of model necrotroph Parastagonospora nodorum Sn15 reveals a genome-wide trove of candidate effector homologs, and redundancy of virulence-related functions within an accessory chromosome
2021cites this paper
Empirical evaluation of methods for de novo genome assembly
2021cites this paper
De novo genome assemblies of butterflies
2021influential citation
Large‐scale genome sampling reveals unique immunity and metabolic adaptations in bats
2021cites this paper
Improved Understanding of the Role of Gene and Genome Duplications in Chordate Evolution With New Genome and Transcriptome Sequences
2021cites this paper
Correspondence of aCGH and long-read genome assembly for detection of copy number differences: A proof-of-concept with cichlid genomes
2021cites this paper
Rephine.r: a pipeline for correcting gene calls and clusters to improve phage pangenomes and phylogenies
2021cites this paper
Phylogenomic analysis of a highly virulent Escherichia coli ST83 lineage with potential animal-human transmission.
2021cites this paper
Targeting Ascomycota genomes: what and how big?
2021cites this paper
Long reads and Hi-C sequencing illuminate the two-compartment genome of the model arbuscular mycorrhizal symbiont Rhizophagus irregularis
2021cites this paper
Evidence for Selection in the Abundant Accessory Gene Content of a Prokaryote Pangenome
2021cites this paper
Uncovering the Role of Metabolism in Oomycete–Host Interactions Using Genome-Scale Metabolic Models
2021cites this paper
Accurate reconstruction of bacterial pan- and core genomes with PEPPAN
2020cites this paper
What Is in Umbilicaria pustulata? A Metagenomic Approach to Reconstruct the Holo-Genome of a Lichen
2020cites this paper
Producing polished prokaryotic pangenomes with the Panaroo pipeline
2020cites this paper
Trends in Helicobacter pylori resistance to clarithromycin: from phenotypic to genomic approaches
2020cites this paper
Comparative analysis of the insect mobile genetic element repertoire and its influence on genome size dynamics
2020cites this paper
Draft Genome of the Asian Buffalo Leech Hirudinaria manillensis
2020cites this paper
Advances in Bioinformatics and Computational Biology: 12th Brazilian Symposium on Bioinformatics, BSB 2019, Fortaleza, Brazil, October 7–10, 2019, Revised Selected Papers
2020cites this paper
Identification and characterization of a Babesia bigemina thrombospondin-related superfamily member, TRAP-1: a novel antigen containing neutralizing epitopes involved in merozoite invasion
2020cites this paper
Enhanced Symbiotic Characteristics in Bacterial Genomes with the Disruption of rRNA Operon
2020cites this paper
A novel high-accuracy genome assembly method utilizing a high-throughput workflow
2020cites this paper
Hyaluronic acid production in Streptococcus equi species
2020cites this paper
Understanding the causes of errors in eukaryotic protein-coding gene prediction: a case study of primate proteomes
2020cites this paper