The sizable amount of data generated by high throughput cell biology is increasing the demand on traditional computational tools in bioinformatics to handle large input datasets. Large sequence data sets create intractable search spaces that are beyond the scope of many conventional algorithms. One way to address this problem is to transform large sequence data sets to the constituent parts that characterize the features of interest (e.g. transcription factor binding sites, miRNA sites, etc.) of the problem. These features of interest take the form of k-mers in a large subset of problems in computational biology. K-mers also play an implicit role in many other bioinformatics functions from microarray probes to genomic compositional analysis.Given the increasing potential for k-mers in a wide spectrum of applications in bioinformatics we present in this paper a set of fast and efficient generic algorithms for enumerating the occurrence frequen-cies of all substrings of a given length (kmers) in whole genome sequences. Described are three algorithms of increasing complexity designed to deal with different k-mer lengths from short (couple of bases) to very long (tens of thousands of bases). They are memory based algorithms that use advanced heuristics to efficiently process large amounts of data that arise when analyzing very long genome sequences. The algorithms were tested for performance on the human, mouse, 681 bacteria and 50 archaea genome sequences. Results are described for both time and space utilization. We also describe several different experiments that demonstrate the utility of these algorithms. These algorithms can be downloaded from http://www2.kumc.edu/siddrc/bioinformatics/publication. html.
Optimum-time, Optimum-space, Algorithms for k-mer Analysis of Whole Genome Sequences
Published 2019 in jbcg
ABSTRACT
PUBLICATION RECORD
- Publication year
2019
- Venue
jbcg
- Publication date
2019-01-01
- Fields of study
Biology, Computer Science
- Identifiers
- External record
- Source metadata
Semantic Scholar
CITATION MAP
EXTRACTION MAP
CLAIMS
- No claims are published for this paper.
CONCEPTS
- No concepts are published for this paper.
REFERENCES
Showing 1-11 of 11 references · Page 1 of 1
CITED BY
Showing 1-2 of 2 citing papers · Page 1 of 1