The principal objectives of the course are to cover the major algorithms used in bioinformatics; sequence alignment, multiple sequence alignment, phylogeny; classifying patterns in sequences; secondary structure prediction; 3D structure prediction; analysis of gene expression data. This includes dynamic programming, machine learning, simulated annealing, and clustering algorithms. Algorithmic principles will be emphasized. A project.
This is a reading course for research students of Dr Butler only.
It covers material on protein sequence analysis required for their research.
All information on content and evaluation is provided in the course outline.
Assignment 1 Due Monday 30 September 2019 at 16:00
Prepare a 15-minute presentation of your work for assignment 1,
covering the design of your Python code, its testing, and its output.
Show examples of your code, and examples of the output.
Submit a hardcopy of the presentation.
Assignment 2 Due Monday 7 October 2019 at 16:00
Submit a hardcopy of your report explaining UPGMA and mbed algorithms for constructing phylogenetic trees.
Project 1 Due Tuesday 15 October 2019 at 16:00
A take-home project to develop Python scripts to cluster the sequences in TCDB (Transporter Classification Database)
based on similarity measures derived from blast pairwise alignment of all pairs of sequences in the TCDB,
and using at least two clustering algorithms.
Do clusters conform to the TC nomenclature?
Are clusters consistent with the GO annotation of the Swissprot entries in the cluster?.
Do the different clustering algorithms agree, or disagree, in their clusters?
Prepare a 15-minute presentation showing the results of clustering, and the conformity of clusters to the TC nomenclature, and the GO annotations.
A technical report of up to 10 pages in IEEE two-column format is due Monday 4 November 2019 at 16:00
Assignment 3 Due Friday 8 November 2019 at 16:00
Prepare a report (but not a 15-minute presentation) of your work for assignment 3,
showing the comparison of the performance of Clustal Omega and T-Coffee
on the Balibase benchmark.
Submit a hardcopy of the technical report.
Assignment 4 Due Monday 11 November 2019 at 16:00
Prepare a 15-minute presentation of your work for assignment 4,
showing the coverage of eggNOG annotations of the genomes,
and the comparison of their GO BP aspect.
How many transport proteins do the organisms have?
Submit a hardcopy of the technical report.
Project 2 Due Monday 25 November 2019 at 16:00
Prepare a 15-minute presentation showing the results of your HMM-based classifiers.
A technical report of up to 10 pages in IEEE two-column format is due (changed) Friday 29 November 2019 at 16:00
Learning Outcomes Knowledge:
Basics of Central Dogma of Genomics: dna, rna, nucleotide, amino acid; transcription, translation; gene, mRNA, protein sequences.
Proteins: amino acid sequences; primary, secondary, tertiary, quaternary structures.
Cell components: nucleus, mitochondrion, chloroplast (in plants), endoplasmic reticulum (ER), Golgi, vacuoles, membranes.
Cell processes: metabolism, catabolism, transport, regulation, signaling; cell cycle; cell death (apotosis).
Become familiar with the context of the project: membrane proteins and their types; classification of transport proteins;
microbial communities (microbiomes); host-microbiome interactions; relevance to agriculture.
Become familiar with metagenomics and other meta-omics.
Learning Outcomes Skills:
Know how to access and query resources.
Understand fasta format for protein sequences, and how to manipulate them.
The PhD and Masters theses by my students will have background on the biology and the bioinformatics methods. This is a very good place to start your reading. All recent Concordia theses can be found in Spectrum, the Library's Open Access portal.
Christine Houry Kehyayan (2013), Using Synteny in Phylogenomics Algorithms to Cluster Proteins. Department of Computer Science and Software Engineering, Concordia University. link
Faizah Aplop (2016), Computational Approaches for Improving the Reconstruction of Metabolic Pathways. Department of Computer Science and Software Engineering, Concordia University. link
Qing Ye (2019), Classifying Transport Proteins Using Profile Hidden Markov Models and Specificity Determining Sites, Department of Computer Science and Software Engineering, Concordia University. link
Akhil Jobby (2019), Multiple Sequence Alignment of Beta Barrel Transmembrane Proteins, Department of Computer Science and Software Engineering, Concordia University. link
Transporters
The following papers, and Munira Alballa's chapter on her dataset, will help understand transporters and their substrate classes.
Milton H. Saier, Jr. A Functional-Phylogenetic Classification System for Transmembrane Solute Transporters, Microbiology and Molecular Biology Reviews, June 2000, p. 354-411 link
Philipp Paparoditis, Ake Vastermark, Andrew J. Le, John A. Fuerst, Milton H. Saier Jr. Bioinformatic analyses of integral membrane transport proteins encoded within the genome of the planctomycetes species, Rhodopirellula baltica, Biochimica et Biophysica Acta 1838 (2014) 193-215 link
Munira Alballa, Comprehensive Report. link
Munira Alballa, Doctoral Proposal. link
Munira Alballa, Chapter on the Dataset. link
Microbiomes
Rob Knight et al, Best practices for analysing microbiomes, Nature Reviews Microbiologyvolume 16, pages410-422 (2018) pubmed link pdf
N Segata et al. Computational meta'omics for microbial community studies. Molecular Systems Biology 9.1 (2013): 666. pubmed
A.L. Gould et al, Microbiome interactions shape host fitness, PNAS December 18, 2018 115 (51) E11951-E11960; first published December 3, 2018 https://doi.org/10.1073/pnas.1809349115 link
P.E. Busby et al, Research priorities for harnessing plant microbiomes in sustainable agriculture, PLoS Biol. 2017 Mar; 15(3): e2001793. Published online 2017 Mar 28. doi: 10.1371/journal.pbio.2001793 link
AM Thomas and N Segata, Multiple levels of the unknown in microbiome research, BMC Biology volume 17, Article number: 48 (2019) link
RL Butt and H Volkoff, Gut Microbiota and Energy Homeostasis in Fish, Front Endocrinol (Lausanne). 2019; 10: 9. Published online 2019 Jan 24. doi: 10.3389/fendo.2019.00009 link
MZNM Zoqratt et al, Microbiome analysis of Pacific white shrimp gut and rearing water from Malaysia and Vietnam: implications for aquaculture research and management, PeerJ. 2018; 6: e5826. Published online 2018 Oct 30. doi: 10.7717/peerj.5826 link
Membrane Proteins
AH Butt, N Rasool, YD Khan, A treatise to computational approaches towards prediction of membrane protein and its subtypes. The Journal of Membrane Biology, 250(1), 2017, 55-76 pubmed
MM Gromiha and YY Ou, Bioinformatics approaches for functional annotation of membrane proteins. Briefings in Bioinformatics. 2014;15(2):155-168. pubmed
KD Tsirigos et al, Topology of membrane proteins --- predictions, limitations and variations. Curr Opin Struct Biol. 2018 Jun;50:9-17. doi: 10.1016/j.sbi.2017.10.003. Epub 2017 Nov 5. pubmed pdf
Get familiar with these resources, and learn how to access them from Python programs.
UniProt
This database contains Swiss-Prot, a manually curated set of protein annotations, as well as TrEMBL, which is the subset of UniProt that is electronically annotaed.
Access is via browser-based queries; a RESTFul API; and a sparql end-point.
NCBI
Home of blast family of algorithms, good introductory information on genomics and bioinformatics, and several databases, notably nr, taxon, and RefSeq.
Also home of pubmed for biomedical literature, and Entrez which links literature, genes, sequences, MeSH ontology, and more.
Ensembl
Home of the annotation of the human genome.
Plus many other genomes.
Model Oragnism Databases
EcoCyc
SGD
TAIR
Biopython
Biopython is a set of freely available tools for biological computation written in Python by an international team of developers.
Learning Outcomes Knowledge:
Alignment as a string matching problem.
Alignment as an optimization problem.
Local vs global alignment; Deterministic, exact alignment vs heursitic alignment; Pairwise vs multiple alignment.
Objective functions; scoring (substitution) matrices; gap penalty, gap opening, gap extension.
Algorithms: dynamic programming.
Algorithms: Needleman-Wunsch, Smith-Waterman.
Algorithms: blast, gapped blast, PSI-blast.
Algorithmic approaches for multiple sequence alignment: progressive, iterative (stochastic and non-stochastic), consistency-based.
Algorithms: T-Coffee, ClustalW, ClustalO, MAFFT, MUSCLE, TM-Coffee.
Learning Outcomes Critical thought
Sequence similarity and sequence homology and their relationship.
Understand grey area of percent identity of similarity as to interpretation of homology.
Learning Outcomes Skills:
Know how to run stand-alone blastall, set parameters, and create tabular output for protein sequences.
Know how to interpret score, e-value, coverage, and percent identity.
Know how to run T-Coffee, ClustalW, ClustalO, MAFFT, MUSCLE, TM-Coffee.
Pairwise Sequence Alignment
NCBI BLAST Documentation
Stephen F. Altschul, Warren Gish, Webb Miller, Eugene W. Myers, and David J. Lipman.
Basic Local Alignment Search Tool.
Journal of Molecular Biology, 215:403-410, 1990.
link
Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui
Zhang, Zheng Zhang, Webb Miller, and David J. Lipman.
Gapped BLAST and PSI-BLAST: a new generation of protein database
search programs.
Nucleic Acids Research, 25(17):3389-3402, 1997.
link
Stephen F. Altschul and Warren Gish.
Local alignment statistics.
Methods in Enzymology, 266:460--480, 1996.
link
Marco Pagni and C. Victor Jongeneel.
Making sense of score statistics for sequence alignments.
Briefings in Bioinformatics, 2(1):51-67, 2001.
link
Multiple Sequence Alignment
Cedric Notredame,
Recent progresses in multiple sequence alignment: a survey.
Pharmacogenomics 3(1) (2002) 131–144.
pubmed
link
Cedric Notredame,
Recent evolutions of multiple sequence alignment algorithms.
PLoS Computational Biology, 2007 Aug 31;3(8):e123
link
Julie D. Thompson, Frédéric Plewniak, and Olivier Poch.
A comprehensive comparison of multiple sequence alignment programs.
Nucleic Acids Research 27.13 (1999): 2682-2690.
link
Timo Lassmann and Erik LL Sonnhammer.
Quality assessment of multiple alignment programs.
FEBS letters 529.1 (2002): 126-130.
link
Julie D. Thompson, Benjamin Linard, Odile Lecompte,, and Olivier Poch.
A Comprehensive Benchmark Study of Multiple Sequence Alignment Methods: Current Challenges and Future Perspectives,
PLoS ONE 6(3): e18093. https://doi.org/10.1371/journal.pone.0018093
link
ClustalW and ClustalO
T-Coffee and TM-Coffee
MUSCLE
MAFFT
Similarity versus Homology
Burkhard Rost,
Twilight zone of protein sequence alignments.
Protein Eng. 1999 Feb;12(2):85-94.
pubmed
link
Burkhard Rost,
Enzyme function less conserved than anticipated.
J Mol Biol. 2002 Apr 26;318(2):595-608.
pubmed
link
Weidong Tian, Jeffrey Skolnick,
How well is enzyme function conserved as a function of pairwise sequence identity?
J Mol Biol. 2003 Oct 31;333(4):863-82.
pubmed
link
Learning Outcomes Knowledge:
Types of amino acids: charge (acidic/neutral), hydrophobic/hydrophilic, small/large, polar, aromatic, basic, etc.
Encodings: AAC, PAAC (dipeptide), PseAAC (Chou's encoding), Split encodings (regional encodings).
Baldi and Brunak, 2nd edition, Table 6.1.
AH Butt, N Rasool, YD Khan,
A treatise to computational approaches towards prediction of membrane protein and its subtypes.
The Journal of Membrane Biology, 250(1), 2017, 55-76
pubmed
Ronit Hod, Refael Kohen, Yael Mandel-Gutfreund.
Searching for protein signatures using a multilevel alphabet.
Proteins. 2013 Jun;81(6):1058-68. doi: 10.1002/prot.24261. Epub 2013 Feb 27.
pubmed
Learning Outcomes Knowledge:
What is a profile Hidden Markov Model (HMM)?
How is the model represented as a data structure?
Viterbi algorithm
Algorithms for HMM building and scanning.
Learning Outcomes Critical thought
HMMs and the detection of remote sequence homology.
Understand how to interpret output of hmmer classifiers.
Learning Outcomes Skills:
Know how to run hmmer tools.
Richard Durbin, Sean R. Eddy, Anders Krogh, and Graeme Mitchison. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, 1998. link pdf
Sean R. Eddy.
Profile hidden Markov models.
Bioinformatics, 14(9):755-763, 1998.
link
Anders Krogh, Michael Brown, I. Saira Mian, Kimmen Sjölander, and David Haussler.
Hidden Markov models in computational biology: Applications to protein modeling.
Journal of Molecular Biology, 235(5):1501-1531, 1994.
link
hmmer
Learning Outcomes Skills:
Know how to run SpeerServer, GroupSim, Xdet, TCS.
Know how to run SpeerServer with Secator and Sci-Phy to compute protein family subgroups.
Elin Teppa, Angela D. Wilkins, Morten Nielsen, and Cristina Marino Buslje. Disentangling evolutionary signals: conservation, specificity determining positions and coevolution. Implication for catalytic residue prediction. BMC Bioinformatics 13, no. 1 (2012): 235. link
Abhijit Chakraborty and Saikat Chakrabarti. A survey on prediction of specificity-determining sites in proteins. Briefings in Bioinformatics 16, no. 1 (2014): 71-88. link
Nelson Gil and Andras Fiser. The choice of sequence homologs included in multiple sequence alignments has a dramatic impact on evolutionary conservation analysis. Bioinformatics. 2019 Jan 1;35(1):12-19. doi: 10.1093/bioinformatics/bty523. link
Secator see the PipeAlign paper below.
Sci-Phy see the Brown et al paper below.
Learning Outcomes Knowledge:
The curation process recording knowledge gleaned from the scientific literature and experiments.
Human annotation following a curation protocol versus automated annotation by a computer algorithm/tool/classifier.
Ontology as a way for a community to share an agreed set of terminology, conceptss, and relationships.
Importance of ontology for data sharing, integration, and machine representation of knowledge.
Familiar with Gene Ontology, ChEBI, ECO.
How Gene Ontology Annotation (GOA) database connects GO with SwissProt and model organism databases.
Learning Outcomes Skills:
Know how to trace trees of GO relationships for a GO term.
Know how to interpret evidence codes.
Curation
International Society for Biocuration
has a brief discussion of
What is biocuraton
elaborated in a recent article:
Biocuration: Distilling data into knowledge.
PLoS Biology Published: April 16, 2018 https://doi.org/10.1371/journal.pbio.2002846
link
Curation and annotation
K Galens, S Daugherty, HH Creasy, S Angiuoli, O White, J Wortman, A Mahurkar, MG Giglio,
The IGS standard operating procedure for automated prokaryotic annotation.
Stand Genomic Sci 4(2) (2011) 244-51
link
BJ Haas, MD Pearson, CA Cuomo, JR Wortman,
Approaches to fungal genome annotation.
Mycology 2(3) (2011) 118-141
link
Marie E Bolger, Borjana Arsova, Björn Usadel,
Plant genome and transcriptome annotations: from misconceptions to simple solutions.
Briefings in Bioinformatics, Volume 19, Issue 3, May 2018, Pages 437-449, https://doi.org/10.1093/bib/bbw135
link
The way the Gene Ontology captures concepts related to transport is presented in the GO Wiki page on Transport and Transporters where the activity of transport of a compound x is discussed. Note that x is specified using ChEBI an ontology for Chemical Entities of Biological Interest.
SwissProt curation of transmembrane proteins is presented here. Note that transmembrane alpha-helix regions are annotated with the help of TMHMM, Memsat, and Phobiiius tools. Note that beta-barrel TMS regions are not annotated.
The Human Protein Atlas presents their protocol for annotating secreted and membrane proteins using a consensus frm multiple tools: MEMSAT3, MEMSAT3-SVM, Phobius, SCAMPI, SPOCTOPUS, THUMBUP, TMHMM, and GCPRHMM.
Ontologies
The Gene Ontology Consortium.
Gene Ontology: tool for the unification of biology.
Natural Genetics, 25(1):25-29, 2000.
link
Janna Hastings, Paula de~Matos, Adriano Dekker, Marcus Ennis, Bhavana Harsha,
Namrata Kale, Venkatesh Muthukrishnan, Gareth Owen, Steve Turner, Mark
Williams, et al.
The ChEBI reference database and ontology for biologically relevant chemistry: enhancements for 2013.
Nucleic Acids Research, 41(D1):D456-D463, 2013.
link
Michelle Giglio, Rebecca Tauber, Suvarna Nadendla, James Munro, Dustin Olley, Shoshannah Ball, Elvira Mitraka, Lynn M Schriml, Pascale Gaudet, Elizabeth T Hobbs, Ivan Erill, Deborah A Siegele, James C Hu, Chris Mungall, Marcus C Chibucos.
ECO, the Evidence and Conclusion Ontology: community standard for evidence information.
Nucleic Acids Research, Volume 47, Issue D1, 08 January 2019, Pages D1186-D1194, https://doi.org/10.1093/nar/gky1036
link
Rachael P Huntley, Tony Sawford, Prudence Mutowo-Meullenet, Aleksandra Shypitsyna, Carlos Bonilla, Maria J Martin, and Claire O'Donovan.
The GOA database: gene ontology annotation updates for 2015.
Nucleic Acids Research, 43(D1):D1057-D1063, 2015.
link
Learning Outcomes Knowledge:
Common ML tasks for protein sequence analysis: secondary structure prediction, signal prediction, subcellular location.
Common approaches to ML tasks: nearest-neighbour, SVM, HMM, similarity, hybrid.
State-of-the-art for each ML task for protein sequence analysis.
Role of amino acid composition in ML for protein sequence analysis.
Role of regions of protein sequence in ML for protein sequence analysis
Learning Outcomes Critical thought
Constructing a "gold standard" dataset for training and testing.
Evaluation of ML classifiers for protein sequence analysis.
Learning Outcomes Skills:
Know how to compute amino acid composition vectors for the whole sequence.
Know how to compute amino acid composition vectors based on regions of the sequence.
Know how to run hmmer3 and use Python scikit-learn library.
Know how to train a classifier and evaluate it for an ML task for protein sequence analysis.
H Nielsen, KD Tsirigos, S Brunak, G von Heijne,
A Brief History of Protein Sorting Prediction.
Protein J. 2019 Jun;38(3):200-216. doi: 10.1007/s10930-019-09838-3.
pubmed
link
Paul Horton, Keun-Joon Park, Takeshi Obayashi, Naoya Fujita, Hajime Harada, CJ Adams-Collier, and Kenta Nakai.
WoLF PSORT: protein localization predictor.
Nucleic Acids Research, 35(suppl 2):W585-W587, 2007.
link
Learning Outcomes Knowledge:
Protein family as a set of proteins related by evolution (or structure, or function).
The pfam database of protein families.
Evolutionary events of speciation and duplication, leading to orthologs and paralogs respectively.
Phylogenomics as a computational attempt to distinguish orthologs and paralogs.
Relationship between phylogenomics, clustering, and phylogenetic tree construction.
Orthologous groups as a computational definition of protein families.
eggNOG as a successor of COG and KOG.
Algorithm for the construction of eggNOG.
Algorithm for eggNOG-mapper to search eggNOG, given a set of protein sequences..
Algorithms to cluster protein sequences: Markov clustering (MCL), Transitivity clustering (TransClust), Heirarchical orthologous groups (HOG).
Learning Outcomes Critical thought
Distinction between phylogenetic trees for species, genes, and proteins.
Difficulty of distnguishing orthologs and paralogs derived from recent duplication events.
Relationship between determining family subgroups and specificity determining sites.
Learning Outcomes Skills:
Know how to search the pfam database, give a protein sequence.
Know how to run eggNOG-mapper and analyse the results.
Know how to run various algorithms to cluster protein sequences based on sequence similarity.
Christine Houry Kehyayan (2013), Using Synteny in Phylogenomics Algorithms to Cluster Proteins. Department of Computer Science and Software Engineering, Concordia University. link
Protein families
Marco Punta, Penny C. Coggill, Ruth Y. Eberhardt, Jaina Mistry, John Tate, Chris Boursnell, Ningze Pang, Kristoffer Forslund, Goran Ceric, Jody Clements, Andreas Heger, Liisa Holm, Erik L. L. Sonnhammer, Sean R. Eddy, Alex Bateman, and Robert D. Finn. The Pfam protein families database. Nucleic Acids Research, 40(D1):D290-D301, 2012 link
Phylogenomics
Adrian M Altenhoff and Christophe Dessimoz.
Inferring orthology and paralogy.
Methods in Molecular Biology, 855:259-279, 2012.
pubmed
link
Natasha Glover, Christophe Dessimoz, Ingo Ebersberger, Sofia K Forslund, Toni Gabaldón, Jaime Huerta-Cepas, Maria-Jesus Martin, Matthieu Muffato, Mateus Patricio, Cécile Pereira, Alan Sousa da Silva, Yan Wang, Quest for Orthologs Consortium, Erik Sonnhammer, Paul D Thomas.
Advances and Applications in the Quest for Orthologs.
Molecular Biology and Evolution, Volume 36, Issue 10, October 2019, Pages 2157-2164, https://doi.org/10.1093/molbev/msz150
link
Orthologous Groups
Sean Powell, Kristoffer Forslund, Damian Szklarczyk, Kalliopi Trachana,
Alexander Roth, Jaime Huerta-Cepas, Toni Gabaldón, Thomas Rattei, Chris
Creevey, Michael Kuhn, et al.
eggNOG v4.0: nested orthology inference across 3686 organisms.
Nucleic Acids Research, Volume 42, Issue D1, 1 January 2014, Pages D231-D239, https://doi.org/10.1093/nar/gkt1253
link
Jaime Huerta-Cepas, Kristoffer Forslund, Luis Pedro Coelho, Damian Szklarczyk, Lars Juhl Jensen, Christian von Mering, and Peer Bork.
Fast genome-wide functional annotation through orthology assignment by eggNOG-mapper.
Molecular Biology and Evolution 34, no. 8 (2017): 2115-2122.
link
Family Subgroups
F Plewniak, L Bianchetti, Y Brelivet, A Carles, F Chalmel, O Lecompte,
T Mochel, L Moulinier, A Muller, J Muller, V Prigent, R Ripp, J~C Thierry JD Thompson, N Wicker, O Poch,
PipeAlign: A new toolkit for protein family analysis.
Nucleic Acids Res 31(13) (2003) 3829-32
link
DP Brown, N Krishnamurthy, K Sjölander,
Automated protein subfamily identification and classification.
PLoS Comput Biol 3(8) (2007) e160
link
Clustering methods
Markov Clustering (MCL) and orthoMCL
Li Li, Christian J. Stoeckert, Jr., and David S. Roos.
OrthoMCL: Identification of Ortholog Groups for Eukaryotic Genomes.
Genome Res. 2003 Sep; 13(9): 2178-2189. doi: 10.1101/gr.1224503
link
TransClust
T. Wittkop, Transitivity Clustering: Clustering biological data by unraveling hidden transitive substructures,
PhD Thesis, Universität Bielefeld, Suedwestdeutscher Verlag fuer Hochschulschriften, 2010.
link
Tobias Wittkop, Dorothea Emig, Sita Lange, Sven Rahmann, Mario Albrecht, John H Morris, Sebastian Böcker, Jens Stoye, Jan Baumbach,
Partitioning biological data with transitivity clustering.
Nature Methods volume 7, pages 419-420 (2010)
link
software
HOG Heirarchical Orthologous groups
Clément-Marie Train, Natasha M Glover, Gaston H Gonnet, Adrian M Altenhoff and Christophe Dessimoz,
Orthologous Matrix (OMA) algorithm 2.0:
more robust to asymmetric evolutionary rates and more scalable hierarchical orthologous group inference,
Bioinformatics. 2017 Jul 15; 33(14): i75-i82. doi: 10.1093/bioinformatics/btx229
link
CD-Hit see below.
Learning Outcomes Skills:
Know that such methods exist.
Martin Steinegger, Markus Meier, Milot Mirdita, Harald Vöhringer, Stephan J. Haunsberger, and Johannes Söding. HH-suite3 for fast remote homology detection and deep protein annotation BMC Bioinformatics. 2019; 20: 473. Published online 2019 Sep 14. doi: 10.1186/s12859-019-3019-7 link
Michael Remmert, Andreas Biegert, Andreas Hauser, Johannes Söding. HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment. Nat Methods. 2011 Dec 25;9(2):173-5. doi: 10.1038/nmeth.1818. link
Limin Fu,Beifang Niu, Zhengwei Zhu, Sitao Wu, and Weizhong Li. CD-HIT: accelerated for clustering the next-generation sequencing data Bioinformatics. 2012 Dec 1; 28(23): 3150-3152. doi: 10.1093/bioinformatics/bts565 link
Benjamin Buchfink, Chao Xie, Daniel H Huson. Fast and sensitive protein alignment using DIAMOND. Nat Methods (2015) 12:59-60. link
Martin Steinegger and Johannes Söding. Clustering huge protein sequence sets in linear time Nat Commun. 2018; 9: 2542. doi: 10.1038/s41467-018-04964-5 link