Unique function words characterize genomic proteins

Published 2018 in Proceedings of the National Academy of Sciences of the United States of America

ABSTRACT

Significance The vast, mostly unknown protein universe can be explored by analyzing protein sequences as a string of domains. A broader coverage can be achieved when these domains, the essential blocks in protein evolution, are detected using sequence profiles. Using clustering to collapse redundant profiles into unique function words (UFWs), we find that over the years 2009–2016, the number of UFWs saturates while the number of sequences matched by a combination of two or more UFWs grows exponentially. Between 2009 and 2016 the number of protein sequences from known species increased 10-fold from 8 million to 85 million. About 80% of these sequences contain at least one region recognized by the conserved domain architecture retrieval tool (CDART) as a sequence motif. Motifs provide clues to biological function but CDART often matches the same region of a protein by two or more profiles. Such synonyms complicate estimates of functional complexity. We do full-linkage clustering of redundant profiles by finding maximum disjoint cliques: Each cluster is replaced by a single representative profile to give what we term a unique function word (UFW). From 2009 to 2016, the number of sequence profiles used by CDART increased by 80%; the number of UFWs increased more slowly by 30%, indicating that the number of UFWs may be saturating. The number of sequences matched by a single UFW (sequences with single domain architectures) increased as slowly as the number of different words, whereas the number of sequences matched by a combination of two or more UFWs in sequences with multiple domain architectures (MDAs) increased at the same rate as the total number of sequences. This combinatorial arrangement of a limited number of UFWs in MDAs accounts for the genomic diversity of protein sequences. Although eukaryotes and prokaryotes use very similar sets of “words” or UFWs (57% shared), the “sentences” (MDAs) are different (1.3% shared).

PUBLICATION RECORD

Publication year
2018
Venue
Proceedings of the National Academy of Sciences of the United States of America
Publication date
2018-06-12
Fields of study
Biology, Mathematics, Computer Science, Medicine
Identifiers
DOI 10.1073/pnas.1801182115 PMID 29895692 PMCID 6042118
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar, PubMed

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Exploring the dark foldable proteome by considering hydrophobic amino acids topology
2017cited by this paper
Computers And Intractability A Guide To The Theory Of Np Completeness
2016cited by this paper
Computers And Intractability A Guide To The Theory Of Np Completeness
2016cited by this paper
The language of the protein universe.
2015cited by this paper
Unexpected features of the dark proteome
2015cited by this paper
CDD: NCBI's conserved domain database
2014cited by this paper
Trends in structural coverage of the protein universe and the impact of the Protein Structure Initiative
2014cited by this paper
On the universe of protein folds.
2013cited by this paper
InterPro in 2011: new developments in the family and domain prediction database
2012cited by this paper
CDD: conserved domains and protein three-dimensional structure
2012cited by this paper
How Many Species Are There on Earth and in the Ocean?
2011cited by this paper
Metagenomics and the protein universe.
2011cited by this paper
InterPro in 2011: new developments in the family and domain prediction database
2011cited by this paper
Sequencing delivers diminishing returns for homology detection: implications for mapping the protein universe
2010cited by this paper
Endogenous non-retroviral RNA virus elements in mammalian genomes
2010cited by this paper
Nature of the protein universe
2009cited by this paper
PROSITE, a protein domain database for functional characterization and annotation
2009cited by this paper
Poxvirus Proteomics and Virus-Host Protein Interactions
2009cited by this paper
Just how versatile are domains?
2008cited by this paper
Domain rearrangements in protein evolution.
2005cited by this paper
The relationship between domain duplication and recombination.
2005cited by this paper
The evolution of domain arrangements in proteins and interaction networks
2005cited by this paper
What is a hidden Markov model?
2004cited by this paper
Gene Duplication: The Genomic Trade in Spare Parts
2004cited by this paper
Evolution of the Protein Repertoire
2003cited by this paper
CDART: protein homology by domain architecture.
2002cited by this paper
Evolution of function in protein superfamilies, from a structural perspective.
2001cited by this paper
Gene duplication and evolution.
2001cited by this paper
Gene Duplication and Evolution
2001cited by this paper
CLICK: a clustering algorithm with applications to gene expression analysis.
2000cited by this paper
The impact of comparative genomics on our understanding of evolution.
2000cited by this paper
Center CLICK: A Clustering Algorithm with Applications to Gene Expression Analysis
2000cited by this paper
Pfam: multiple sequence alignments and HMM-profiles of protein domains
1998cited by this paper
SMART, a simple modular architecture research tool: identification of signaling domains.
1998cited by this paper
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.
1997cited by this paper
Protein superfamilles and domain superfolds
1994cited by this paper
Proteins. One thousand families for the molecular biologist.
1992cited by this paper
One thousand families for the molecular biologist
1992cited by this paper
Profile analysis.
1990cited by this paper
[9] Profile analysis
1990cited by this paper
Profile analysis: detection of distantly related proteins.
1987cited by this paper
Evolution and tinkering.
1977cited by this paper
A rapid method for determining sequences in DNA by primed synthesis with DNA polymerase.
1975cited by this paper
Profile Analysis
1970cited by this paper

CITED BY

The Analysis of Solanum lycopersicum Sap Dark Proteome Reveals Ordered and Disordered Protein Abundance
2025cites this paper
Hydrophobic Cluster Analysis at protein and proteome scales.
2025cites this paper
A methodology for calculating the rarity of diverse proteins based on functional specificity and thermodynamic stability
2025cites this paper
Bridging Themes: Short Protein Segments Found in Different Architectures
2020cites this paper
Current approaches for integrating solution NMR spectroscopy and small angle scattering to study the structure and dynamics of biomolecular complexes.
2020cites this paper
Improved RAD51 binders through motif shuffling based on the modularity of BRC repeats
2020cites this paper
Grammar of protein domain architectures
2019cites this paper
Large-Scale Analyses of Human Microbiomes Reveal Thousands of Small, Novel Genes.
2019cites this paper
Resource Large-Scale Analyses of HumanMicrobiomes Reveal Thousands of Small , Novel Genes Graphical
2019cites this paper
Large-scale analyses of human microbiomes reveal thousands of small, novel genes and their predicted functions
2018cites this paper
7-Transmembrane Helical (7TMH) Proteins: Pseudo-Symmetry and Conformational Plasticity
2018cites this paper
Pseudo-Symmetry and Conformational Plasticity in 7 Transmembrane Helix (7TMH) Proteins: Intragenic Duplication and Assembly of 3TMH or 4TMH Protodomains with Evolutionary Balance of Structural Constraints and Functional Divergence
2018cites this paper