Estimating the information value of polymorphic sites using pooled sequences

Published 2014 in BMC Genomics

ABSTRACT

High-throughput sequencing is a cost effective method for identifying genetic variation, and it is currently in use on a large scale across the field of biology, including ecology and population genetics. Correctly identifying variable sites and allele frequencies from sequencing data remains challenging, in large part due to artifacts and biases inherent in the sequencing process. Selecting variants that are diagnostic is commonly done using diversity statistics like FST, but these measures are not ideal for the task. Here, we develop a method that directly calculates the expected amount of information gained from observing each variant site. We then develop and implement a conservative estimator that takes into account uncertainity introduced by sampling bias and sequencing error. This estimator is applied to simulated and real sequencing data, and we discuss how it performs compared to the commonly used existing methods for identifying diagnostic polymorphisms. The expected information content gives an easy to interpret measure for the usefulness of variant sites. The results show that we achieve a clear separation between true variants and noise, allowing us to select candidate sites with a high degree of confidence.

PUBLICATION RECORD

Publication year
2014
Venue
BMC Genomics
Publication date
2014-10-01
Fields of study
Biology, Medicine, Computer Science
Identifiers
DOI 10.1186/1471-2164-15-S6-S20 PMID 25571927 PMCID 4239578
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar, PubMed

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Simulating a population genomics data set using FlowSim
2014cited by this paper
Research Commentary - Too Big to Fail: Large Samples and the p-Value Problem
2013cited by this paper
Quantifying Population Genetic Differentiation from Next-Generation Sequencing Data
2013cited by this paper
Generic genetic differences between farmed and wild Atlantic salmon identified from a 7K SNP‐chip
2011cited by this paper
PoPoolation: A Toolbox for Population Genetic Analysis of Next Generation Sequencing Data from Pooled Individuals
2011cited by this paper
Next generation sequencing has lower sequence coverage and poorer SNP-detection capability in the regulatory regions
2011cited by this paper
vipR: variant identification in pooled DNA using R
2011cited by this paper
To Pool, or Not to Pool?
2010cited by this paper
The Next Generation of Molecular Markers From Massively Parallel Sequencing of Pooled DNA Samples
2010cited by this paper
Characteristics of 454 pyrosequencing data—enabling realistic simulation with flowsim
2010cited by this paper
Accurate detection and genotyping of SNPs utilizing population sequencing data.
2010cited by this paper
Genetics in geographically structured populations: defining, estimating and interpreting FST
2009cited by this paper
The Sequence Alignment/Map format and SAMtools
2009cited by this paper
GST and its relatives do not measure differentiation
2008cited by this paper
Testing for Neutrality in Samples With Sequencing Errors
2008cited by this paper
Effective selection of informative SNPs and classification on the HapMap genotype data
2007cited by this paper
Efficient mapping of mendelian traits in dogs through genome-wide association
2007cited by this paper
Informativeness of genetic markers for inference of ancestry.
2003influential reference
Estimating F-statistics.
2002cited by this paper
Erratum: A DNA polymorphism discovery resource for research on human genetic variation (Genome Research (1998) 8 (1229-1231))
1999cited by this paper
Approximate is Better than “Exact” for Interval Estimation of Binomial Proportions
1998cited by this paper
A DNA polymorphism discovery resource for research on human genetic variation.
1998cited by this paper
On Information and Sufficiency
1997cited by this paper
Bioinformatics Advance Access published February 23, 2008 The effect of sequence quality on sequence alignment
year unknowncited by this paper
BIOINFORMATICS ORIGINAL PAPER
year unknowncited by this paper