EVEREST: automatic identification and classification of protein domains in all protein sequences

Elon Portugaly,Amir Harel,N. Linial,Michal Linial

Published 2006 in BMC Bioinformatics

ABSTRACT

BackgroundProteins are comprised of one or several building blocks, known as domains. Such domains can be classified into families according to their evolutionary origin. Whereas sequencing technologies have advanced immensely in recent years, there are no matching computational methodologies for large-scale determination of protein domains and their boundaries. We provide and rigorously evaluate a novel set of domain families that is automatically generated from sequence data. Our domain family identification process, called EVEREST (EVolutionary Ensembles of REcurrent SegmenTs), begins by constructing a library of protein segments that emerge in an all vs. all pairwise sequence comparison. It then proceeds to cluster these segments into putative domain families. The selection of the best putative families is done using machine learning techniques. A statistical model is then created for each of the chosen families. This procedure is then iterated: the aforementioned statistical models are used to scan all protein sequences, to recreate a library of segments and to cluster them again.ResultsProcessing the Swiss-Prot section of the UniProt Knoledgebase, release 7.2, EVEREST defines 20,230 domains, covering 85% of the amino acids of the Swiss-Prot database. EVEREST annotates 11,852 proteins (6% of the database) that are not annotated by Pfam A. In addition, in 43,086 proteins (20% of the database), EVEREST annotates a part of the protein that is not annotated by Pfam A. Performance tests show that EVEREST recovers 56% of Pfam A families and 63% of SCOP families with high accuracy, and suggests previously unknown domain families with at least 51% fidelity. EVEREST domains are often a combination of domains as defined by Pfam or SCOP and are frequently sub-domains of such domains.ConclusionThe EVEREST process and its output domain families provide an exhaustive and validated view of the protein domain world that is automatically generated from sequence data. The EVEREST library of domain families, accessible for browsing and download at [1], provides a complementary view to that provided by other existing libraries. Furthermore, since it is automatic, the EVEREST process is scalable and we will run it in the future on larger databases as well. The EVEREST source files are available for download from the EVEREST web site.

PUBLICATION RECORD

Publication year
2006
Venue
BMC Bioinformatics
Publication date
2006-06-02
Fields of study
Biology, Medicine, Computer Science
Identifiers
DOI 10.1186/1471-2105-7-277 PMID 16749920 PMCID 1533870
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar, PubMed

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

EVEREST:
2008cited by this paper
The Pfam protein families database
2007cited by this paper
EVEREST: automatic identification and classification of protein domains in all protein sequences
2006cited by this paper
An organizational grid of federated MOSIX clusters
2005cited by this paper
The Universal Protein Resource (UniProt): an expanding universe of protein information
2005cited by this paper
Smooth e-Intensive Regression by Loss Symmetrization
2005cited by this paper
Smooth epsiloon-Insensitive Regression by Loss Symmetrization
2005cited by this paper
Automatic prediction of protein domains from sequence information using a hybrid learning system
2004cited by this paper
A functional hierarchical organization of the protein sequence space
2004cited by this paper
CHOP: parsing proteins into structural domains
2004cited by this paper
A robust method to detect structural and functional remote homologues
2004cited by this paper
HMMERHEAD-Accelerating HMM Searches On Large Databases
2004cited by this paper
ProtoNet: hierarchical classification of the protein space
2003influential reference
Domains, motifs and clusters in the protein universe.
2003cited by this paper
Protein structure prediction via combinatorial assembly of sub-structural units
2003cited by this paper
The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003
2003cited by this paper
Exhaustive enumeration of protein domain families.
2003cited by this paper
The metric space of proteins-comparative study of clustering algorithms
2002cited by this paper
The Pfam protein families database.
2002cited by this paper
ProDom: Automated Clustering of Homologous Domains
2002cited by this paper
ASTRAL compendium enhancements
2002cited by this paper
Sir Paul Nurse
2001cited by this paper
InterPro-an integrated documentation resource for protein families, domains and functional sites
2000cited by this paper
SMART: a web-based tool for the study of genetically mobile domains
2000cited by this paper
The ENZYME database in 2000
2000cited by this paper
The Protein Data Bank
2000cited by this paper
SCOP: a Structural Classification of Proteins database
1999cited by this paper
SCOP: a structural classification of proteins database
1998cited by this paper
Automated protein sequence database classification. I. Integration of compositional similarity search, local similarity search, and multiple sequence alignment
1998cited by this paper
Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods.
1998cited by this paper
DOMO: a new database of aligned protein domains.
1998cited by this paper
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.
1997cited by this paper
CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice.
1994cited by this paper
Identification of common molecular subsequences.
1981cited by this paper
Electronic Reprint Biological Crystallography the Protein Data Bank Biological Crystallography the Protein Data Bank
year unknowninfluential reference

CITED BY

DPCfam: Unsupervised protein family classification by Density Peak Clustering of large sequence datasets
2022cites this paper
Density Peak clustering of protein sequences associated to a Pfam clan reveals clear similarities and interesting differences with respect to manual family annotation
2021cites this paper
DPCfam: a new method for unsupervised protein family classification
2020cites this paper
DNN-Dom: predicting protein domain boundary from sequence alone by deep neural network
2019cites this paper
Alignment-free clustering of large data sets of unannotated protein conserved regions using minhashing
2018cites this paper
DeepDom: Predicting protein domain boundary from sequence alone using stacked bidirectional
2018cites this paper
ThreaDomEx: a unified platform for predicting continuous and discontinuous protein domains by multiple-threading and segment assembly
2017cites this paper
A Fast Alignment-Free Approach for De Novo Detection of Protein Conserved Regions
2016cites this paper
Extending Protein Domain Boundary Predictors to Detect Discontinuous Domains
2015cites this paper
Current Advances in the Identification and Characterization of Putative Drug and Vaccine Targets in the Bacterial Genomes.
2015cites this paper
Classification of Protein Structure using SVM
2015cites this paper
Classify a Protein Domain Using Sigmoid Support Vector Machine
2014cites this paper
Classify a Protein Domain Using SVM Sigmoid Kernel
2014cites this paper
Automated Sequence‐Based Approaches for Identifying Domain Families
2013cites this paper
A Pluralistic Account of Homology: Adapting the Models to the Data
2013influential citation
ThreaDom: extracting protein domain boundary information from multiple threading alignments
2013cites this paper
Protein domain prediction using context statistics, the false discovery rate, and comparative genomics, with application to Plasmodium falciparum
2013cites this paper
Functional inference by ProtoNet family tree: the uncharacterized proteome of Daphnia pulex
2013cites this paper
Assessing the relationship between conservation of function and conservation of sequence using photosynthetic proteins
2012cites this paper
Protein Domains as Evolutionary Units
2010cites this paper
First insight into the human liver proteome from PROTEOME(SKY)-LIVER(Hu) 1.0, a publicly available database.
2010cites this paper
Liverbase: a comprehensive view of human liver biology.
2010cites this paper
The bologna annotation resource: a non hierarchical method for the functional and structural annotation of protein sequences relying on a comparative large-scale genome analysis.
2009cites this paper
COMPUTATIONAL METHODS FOR THE ANALYSIS OF PROTEIN STRUCTURE AND FUNCTION
2009cites this paper
Expansion of tandem repeats in sea anemone Nematostella vectensis proteome: A source for gene novelty?
2009cites this paper
Designing Patterns and Profiles for Faster HMM Search
2009cites this paper
Connect the dots: exposing hidden protein family connections from the entire sequence tree
2008cites this paper
Protein domain prediction.
2008cites this paper
Parallel Large Scale Inference of Protein Domain Families
2008cites this paper
When Less Is More: Improving Classification of Protein Families with a Minimal Set of Global Features
2007cites this paper
ProCKSI: a decision support system for Protein (Structure) Comparison, Knowledge, Similarity and Information
2007cites this paper
Growth of novel protein structural data
2007cites this paper
Assessment of Protein Domain Classiﬁcations: SCOP, CATH, Dali and EVEREST
2007cites this paper
EVEREST: automatic identification and classification of protein domains in all protein sequences
2006cites this paper
EVEREST: a collection of evolutionary conserved protein domains
2006influential citation