Sequence embedding for fast construction of guide trees for multiple sequence alignment

G. Blackshields,Fabian Sievers,Weifeng Shi,A. Wilm,D. Higgins

Published 2010 in Algorithms for Molecular Biology

ABSTRACT

BackgroundThe most widely used multiple sequence alignment methods require sequences to be clustered as an initial step. Most sequence clustering methods require a full distance matrix to be computed between all pairs of sequences. This requires memory and time proportional to N2 for N sequences. When N grows larger than 10,000 or so, this becomes increasingly prohibitive and can form a significant barrier to carrying out very large multiple alignments.ResultsIn this paper, we have tested variations on a class of embedding methods that have been designed for clustering large numbers of complex objects where the individual distance calculations are expensive. These methods involve embedding the sequences in a space where the similarities within a set of sequences can be closely approximated without having to compute all pair-wise distances.ConclusionsWe show how this approach greatly reduces computation time and memory requirements for clustering large numbers of sequences and demonstrate the quality of the clusterings by benchmarking them as guide trees for multiple alignment. Source code is available for download from http://www.clustal.org/mbed.tgz.

PUBLICATION RECORD

Publication year
2010
Venue
Algorithms for Molecular Biology
Publication date
2010-05-14
Fields of study
Medicine, Computer Science
Identifiers
DOI 10.1186/1748-7188-5-21 PMID 20470396 PMCID 2893182
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar, PubMed

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Alignment
2009influential reference
Recent developments in the MAFFT multiple sequence alignment program
2008cited by this paper
Fast embedding methods for clustering tens of thousands of sequences
2008influential reference
Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space
2008cited by this paper
The Ribosomal Database Project: improved alignments and new tools for rRNA analysis
2008cited by this paper
Progressive sequence alignment as a prerequisitetto correct phylogenetic trees
2007cited by this paper
Clustal W and Clustal X version 2.0
2007cited by this paper
PartTree: an algorithm to build an approximate tree from a large number of unaligned sequences
2007influential reference
PROMALS: towards accurate multiple sequence alignments of distantly related proteins
2007cited by this paper
BAliBASE 3.0: Latest developments of the multiple sequence alignment benchmark
2005cited by this paper
Pfam: clans, web tools and services
2005influential reference
ProbCons: Probabilistic consistency-based multiple sequence alignment.
2005cited by this paper
The alignment of sets of sequences and the construction of phyletic trees: An integrated method
2005cited by this paper
PHYLIP (Phylogeny Inference Package)
2004cited by this paper
Rfam: annotating non-coding RNAs in complete genomes
2004cited by this paper
MUSCLE: multiple sequence alignment with high accuracy and high throughput.
2004influential reference
MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform.
2002influential reference
T-Coffee: A novel method for fast and accurate multiple sequence alignment.
2000cited by this paper
Cluster-preserving Embedding of Proteins
1999influential reference
HOMSTRAD: A database of protein structure alignments for homologous families
1998influential reference
Pfam: A comprehensive database of protein domain families based on seed alignments
1997influential reference
Global self-organization of all known protein sequences reveals inherent biological signatures.
1997cited by this paper
The geometry of graphs and some of its algorithmic applications
1994cited by this paper
CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice.
1994cited by this paper
Multiple sequence alignment by a pairwise algorithm
1987cited by this paper
The neighbor-joining method: a new method for reconstructing phylogenetic trees.
1987cited by this paper
Rapid similarity searches of nucleic acid and protein data banks.
1983cited by this paper
Comparison of phylogenetic trees
1981cited by this paper
A general method applicable to the search for similarities in the amino acid sequence of two proteins.
1970cited by this paper
Some distance properties of latent root and vector methods used in multivariate analysis
1966cited by this paper
Binary codes capable of correcting deletions, insertions, and reversals
1965cited by this paper
Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis
1964cited by this paper
Numerical Taxonomy
1962influential reference

CITED BY

Clustering-based progressive alignment with fuzzy logic (CPA-FL).
2026cites this paper
Docking and Antimicrobial Potential of Lactoferrin in Capra hircus: An In-silico Approach
2025cites this paper
Trace alignment algorithm optimization
2025cites this paper
An nf-core framework for the systematic comparison of alternative modeling tools: the multiple sequence alignment case study
2025cites this paper
In silico prediction of cytotoxic T-cell epitopes from Helicobacter pylori virulence factors using an immunoinformatics approach
2025cites this paper
Nitrative stress-induced dysregulation of placental AQUAPORIN-9: A potential key player in preeclampsia pathogenesis.
2025cites this paper
Scalable Guide Tree Construction Using Quantum Annealing for Multiple Sequence Alignment
2025cites this paper
The Computer‐Assisted Sequence Annotation (CASA) workflow for enzyme discovery
2025cites this paper
Accurately clustering biological sequences in linear time by relatedness sorting
2024cites this paper
Genotypic Clustering of H5N1 Avian Influenza Viruses in North America Evaluated by Ordination Analysis
2024cites this paper
Phylogenomic curation of Ovate Family Proteins (OFPs) in the U’s Triangle of Brassica L. indicates stress-induced growth modulation
2024cites this paper
WMSA 2: a multiple DNA/RNA sequence alignment tool implemented with accurate progressive mode and a fast win-win mode combining the center star and progressive strategies
2023cites this paper
Efficient Representation of Biochemical Structures for Supervised and Unsupervised Machine Learning Models Using Multi-Sensoric Embeddings
2023cites this paper
UPP2: fast and accurate alignment of datasets with fragmentary sequences
2023cites this paper
Towards the accurate alignment of over a million protein sequences: Current state of the art.
2023cites this paper
Overview of Current MSA Programs
2022cites this paper
Methodology-Centered Review of Molecular Modeling, Simulation, and Prediction of SARS-CoV-2
2022cites this paper
Package ‘kmer’
2022cites this paper
Comparative mucomic analysis of three functionally distinct Cornu aspersum Secretions
2022influential citation
SALMA: Scalable ALignment using MAFFT-Add
2022cites this paper
CBMDB: A Database for Accessing, Analyzing, and Mining CBM Information
2022cites this paper
Developments in Algorithms for Sequence Alignment: A Review
2022cites this paper
UPP2: Fast and Accurate Alignment Estimation of Datasets with Fragmentary Sequences
2022cites this paper
PyPAn: An automated graphical user interface for protein sequence and structure analyses.
2022cites this paper
Embeddings of genomic region sets capture rich biological associations in lower dimensions
2021cites this paper
The Clustal Omega Multiple Alignment Package.
2021cites this paper
Multiple Sequence Alignment Computation Using the T-Coffee Regressive Algorithm Implementation.
2021cites this paper
Hidden service publishing flow homology comparison using profile‐hidden markov model
2021cites this paper
Recent progress on methods for estimating and updating large phylogenies
2021cites this paper
Accuracy of multiple sequence alignment methods in the reconstruction of transposable element families
2021cites this paper
Exploring Onchocerca volvulus Cysteine Protease Inhibitor for Multi-epitope Subunit Vaccine Against Onchocerciasis: An Immunoinformatics Approach
2021cites this paper
Molecular characterization and biological activity of native Spodoptera littoralis nucleopolyhedrovirus isolate
2021cites this paper
Ancestral sequence reconstruction - An underused approach to understand the evolution of gene function in plants?
2021cites this paper
Assessing SARS-CoV-2 spatial phylogenetic structure: Evidence from RNA and protein sequences
2020cites this paper
Improved production of Humira antibody in the genetically engineered Escherichia coli SHuffle, by co-expression of human PDI-GPx7 fusions
2020cites this paper
Bayesian Phylogenomic Dating
2020cites this paper
TportHMM: Predicting the substrate class of transmembrane transport proteins using profile Hidden Markov Models
2020cites this paper
Host Immune Response Driving SARS-CoV-2 Evolution
2020cites this paper
The Short- and Long-Range RNA-RNA Interactome of SARS-CoV-2
2020cites this paper
Removing Dust By Metacrawler
2020cites this paper
Evaluation of two viral isolates as a potential biocontrol agent against the Egyptian cotton leafworm, Spodoptera littoralis (Boisd.) (Lepidoptera: Noctuidae)
2020cites this paper
Mega-phylogeny sheds light on SARS-CoV-2 spatial phylogenetic structure
2020cites this paper
Prediction of the Three-Dimensional Structure of Phosphate-6-mannose PMI Present in the Cell Membrane of Xanthomonas citri subsp. citri of Interest for the Citrus Canker Control
2020cites this paper
Intact caveolae are required for proper extravillous trophoblast migration and differentiation
2020cites this paper
SpliVert: A Protein Multiple Sequence Alignment Refinement Method Based on Splitting-Splicing Vertically.
2019cites this paper
Toward insights on determining factors for high activity in antimicrobial peptides via machine learning
2019cites this paper
QuanTest2: benchmarking multiple sequence alignments using secondary structure prediction
2019cites this paper
Large multiple sequence alignments with a root-to-leaf regressive method
2019cites this paper
A Quantum-inspired optimization Heuristic for the Multiple Sequence Alignment Problem in Bio-computing
2019cites this paper
Analysis of Metacrawler approach for URL based DUST removal by knowledge engineering systems
2019cites this paper
Kalign 3: multiple sequence alignment of large datasets
2019cites this paper
DUST Removal Framework Based on Improved Multiple Sequence Alignment Technique
2019cites this paper
Übersicht aktueller MSA-Programme
2019cites this paper
Studying the Evolution of Histone Variants Using Phylogeny.
2018cites this paper
Clustal Omega for making accurate alignments of many protein sequences
2018cites this paper
Progressive multiple sequence alignment with indel evolution
2018cites this paper
Motif-Aware PRALINE: Improving the alignment of motif regions
2018cites this paper
iDUSTER: Improved Method for Removing DUST Based on Efficient Multiple Sequence Alignment Technique
2018cites this paper
Parallel PoMSA for Aligning Multiple Biological Sequences on Multicore Computers
2018cites this paper
Visual Search and Analysis in Molecular Biology
2018cites this paper
Universal Architectural Concepts Underlying Protein Folding Patterns
2018cites this paper
Fast and accurate large multiple sequence alignments using root-to-leave regressive computation
2018cites this paper
Computational modeling of cohesin ATPase dynamics
2018cites this paper
Bioinformatic characterisation of genes associated with coenzyme A biosynthesis in mycoplasmas and expression and isolation of dephospho-coenzyme A kinase from Mycoplasma sp. Ms02
2018cites this paper
Dust Elimination using Web Spider
2018cites this paper
Bioinformatics
2018cites this paper
Multiple Sequence Alignment.
2017cites this paper
Selecting the "Closest to Optimal" Multiple Sequence Alignment Using Multi-Layer Perceptron
2017cites this paper
MAFFT online service: multiple sequence alignment, interactive sequence choice and visualization
2017cites this paper
Multiobjective characteristic-based framework for very-large multiple sequence alignment
2017cites this paper
TurboFold II: RNA structural alignment and secondary structure prediction informed by multiple homologs
2017influential citation
Identifiability of Phylogenetic Parameters from k-mer Data Under the Coalescent
2017cites this paper
Integrating networks and comparative genomics reveals retroelement proliferation dynamics in hominid genomes
2017cites this paper
Evolution and regulation of Bigelowiella natans light-harvesting antenna system.
2017cites this paper
Superior long-term synaptic memory induced by combining dual pharmacological activation of PKA and ERK with an enhanced training protocol
2017cites this paper
Near Duplicate URL Detection for Removing Dust Unique Key
2017cites this paper
Antibody Therapeutic Prediction and Design: Immunogenicity and Stability
2017cites this paper
Protein multiple sequence alignment benchmarking through secondary structure prediction
2017cites this paper
A Shared Memory Method For Enhancing The HTNGH AlgorithmPerformance: Proposed Method
2017cites this paper
Bayesian Top-Down Protein Sequence Alignment with Inferred Position-Specific Gap Penalties
2016cites this paper
Inferring Disease Associated Phosphorylation Sites via Random Walk on Multi-Layer Heterogeneous Network
2016cites this paper
FAMSA: Fast and accurate multiple sequence alignment of huge protein families
2016cites this paper
Multiple sequence alignment modeling: methods and applications
2016cites this paper
Dust Removal for Improving Search Pattern Analysis for Effective Web Results
2016cites this paper
The impact of guide trees in large-scale protein multiple sequence alignments
2016influential citation
Hacia una nueva filogenia de Tulasnella mediante la combinación de descriptores moleculares
2016cites this paper
Phylogenetic Tree Estimation With and Without Alignment: New Distance Methods and Benchmarking
2016cites this paper
Multiple Sequence Alignment Using External Sources Of Information
2016cites this paper
QuickProbs 2: Towards rapid construction of high-quality alignments of large protein families
2015cites this paper
Homology modeling and docking study of Danio rerio Carbonic Anhydrase VI - Pentraxin protein and bioinformatics analysis of extra-cellular CAs
2015cites this paper
Using de novo protein structure predictions to measure the quality of very large multiple sequence alignments
2015cites this paper
OD-seq: outlier detection in multiple sequence alignments
2015cites this paper
Efficient agglomerative hierarchical clustering for biological sequence analysis
2015cites this paper
Instability in progressive multiple sequence alignment algorithms
2015cites this paper
Statistically Consistent k-mer Methods for Phylogenetic Tree Reconstruction
2015cites this paper
Constitutive overexpression of the TaNF-YB4 gene in transgenic wheat significantly improves grain yield
2015cites this paper
Recovering accuracy methods for scalable consistency library
2015cites this paper
A Survey of Multiple Sequence Alignment Techniques
2015cites this paper
Removing DUST Using Multiple Alignment of Sequences
2015cites this paper
Clustal Omega
2014cites this paper