COMP 6811 Bioinformatics Algorithms

Fall 2022 Section FF


Lectures


Week 1 Wednesday 2022-09-07
Introduction. Biology, genomics, biotechnology, bioinformatics. Central dogma: transcription of DNA to mRNA, translation of mRNA to protein. Sequence to Structure to Function. Protein families, clustering, orthologous groups. eggNOG.

slides

Read Jeff Gauthier, Antony T Vincent, Steve J Charette, Nicolas Derome, A brief history of bioinformatics, Briefings in Bioinformatics, Volume 20, Issue 6, November 2019, Pages 1981–1996, https://doi.org/10.1093/bib/bby063

People of Note: Gene Myers, Burkhard Rost, Peer Bork, Gaston Gonnet, Des Higgins, Johannes Söding, Daniel Huson, Christophe Dessimoz

Week 2 Wednesday 2022-09-14
Protein sequence resources. Amino acid encodings. Pairwise sequence alignment: substitution matrices, gap penalties. Altschul statistics. score, e-value, percent identity, coverage. CD-Hit.
Assignment 1 string algorithms due.

Deoxyribonucleic acid (DNA)
Nucleotide a, c, t, g and u; IUPAC codes
Amino acid
FASTA format
UniProt protein entry: enzyme Beta-galactosidase and Blast search against model organisms at NCBI
Smith-Waterman algorithm
Blast
Blast practicalities Note tabular output

Other proteins
UniProt protein entry: Haemoglobin unit Blast against nr
UniProt protein entry: GLUT1 glucose transporter Blast against refseq_protein

Other material
Report on pairwise alignment algorithms
Blast ABC - Detailed slides
More lectures from Teresa Przytycka at NCBI
For Blast practicalities, see also Bioinformatics explained: BLAST

Week 3 Wednesday 2022-09-21
Guest lectures on ontologies and deep learning language models for proteins.

Week 4 Wednesday 2022-09-28
Multiple sequence alignment. PSSM, PSI-Blast. profile Hidden Markov Models (HMM). hmmer.

Multiple Sequence Alignment

Overview article
Cedric Notredame, Recent progress in multiple sequence alignment: a survey, Pharmacogenomics 3 (1) (2002) 131-144

Also
Maria Chatzou, Cedrik Magis, Jia-Ming Chang, Carsten Kemena, Giovanni Bussotti, Ionas Erb, Cedric Notredame, Multiple sequence alignment modeling: methods and applications, Briefings in Bioinformatics, Volume 17, Issue 6, November 2016, Pages 1009–1023, https://doi.org/10.1093/bib/bbv099

Clustal Family wikipedia
Clustal (1988)
ClustaW (1994)
ClustalO [Clustal Omega] (2002)

Other major algorithms
MAFFT [multiple alignment using fast Fourier transform] (2002)
MUSCLE [MUltiple Sequence Comparison by Log-Expectation] (2004)
T-Coffee [Tree-based Consistency Objective Function for Alignment Evaluation] (2000)

Benchmarking
BAliBASE: a benchmark alignment database for the evaluation of multiple alignment programs (1999, 2001, 2005)
BAliBASE 4
Thompson JD, Linard B, Lecompte O, Poch O. A comprehensive benchmark study of multiple sequence alignment methods: current challenges and future perspectives. PLoS One. 2011 Mar 31;6(3):e18093. doi: 10.1371/journal.pone.0018093. PMID: 21483869; PMCID: PMC3069049.

MSA for eggNOG
Muller J, Creevey CJ, Thompson JD, Arendt D, Bork P. AQUA: automated quality improvement for multiple sequence alignments. Bioinformatics. 2010 Jan 15;26(2):263-5. doi: 10.1093/bioinformatics/btp651. Epub 2009 Nov 19. PMID: 19926669.
combines MAFFT and MUSCLE
using RASCAL to refine MSAs
NORMD to evaluate quality of MSA


Assignment 2 k-mer algorithms due.

Week 5 Wednesday 2022-10-05
HH-Suite alignment/search using pHMMs.

video on PSSM, HMM, COG
video oh HMM Viterbi algorithm
video on profile HMM for sequence alignment
Profile hidden Markov models Sean Eddy, Bioinformatics, 1998.
video HMMER: Fast and sensitive sequence similarity searches
Protein homology detection by HMM–HMM comparison Johannes Soeding, Bioinformatics, 2005.

HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment Remmert, Biegert, Hauser, Soeding, Nature Methods, 2012.
HH-suite3 for fast remote homology detection and deep protein annotation Steinegger, ..., Soeding, BMC Bioinformatics, 2019.

Make-Up Day for Quebec Election Day Wednesday 2022-10-12
No lectures for COMP 6811.

Week 6 Wednesday 2022-10-19
DIAMOND
Benjamin Buchfink, Chao Xie, and Daniel Huson. Fast and sensitive protein alignment using DIAMOND. Nature Methods 12, 59-60 (2015). https://doi.org/10.1038/nmeth.3176
DIAMOND_MEGAN talk by Daniel Huson, 2021.
Functional Analysis of Human Microbiome(video) by Curtis Huttenhower, 2013. See paper, Segata et al, Mol Syst Biol 2013.

A framework for microbiome science in public health, Wilkinson et al, Nature Methods 2021.
The Integrative Human Microbiome Project Nature, 2019.

Rob Knight et al, Best practices for analysing microbiomes, Nature Reviews Microbiologyvolume 16, pages410-422 (2018) pubmed link pdf

Mini-Project Report clustering proteins due.

Week 7 Wednesday 2022-10-26
Mid-Term Examination.

Week 8 Wednesday 2022-11-02
Protein families, orthologous groups, phylogenomics.

Week 9 Wednesday 2022-11-09
eggNOG construction.
Assignment 3 TBD due.

Week 10 Wednesday 2022-11-16
eggNOG search using eggNOG-mapper.

Week 11 Wednesday 2022-11-23
TBD
Assignment 4 TBD due.

Week 12 Wednesday 2022-11-30
Project presentations.

Week 13 Wednesday 2022-12-07
Project presentations.
Project Report due.


Learning Outcomes

Learning Outcomes Knowledge:

Cell Biology:
Basics of Central Dogma of Genomics: dna, rna, nucleotide, amino acid; transcription, translation; gene, mRNA, protein sequences.
Proteins: amino acid sequences; primary, secondary, tertiary, quaternary structures.
Cell components: nucleus, mitochondrion, chloroplast (in plants), endoplasmic reticulum (ER), Golgi, vacuoles, membranes.
Cell processes: metabolism, catabolism, transport, regulation, signaling; cell cycle; cell death (apotosis).
Become familiar with microbial communities (microbiomes); host-microbiome interactions; relevance.
Become familiar with metagenomics and other meta-omics.

Sequence Alignment:
Alignment as a string matching problem.
Alignment as an optimization problem.
Local vs global alignment; Deterministic, exact alignment vs heursitic alignment; Pairwise vs multiple alignment.
Objective functions; scoring (substitution) matrices; gap penalty, gap opening, gap extension.
Algorithms: dynamic programming.
Algorithms: Needleman-Wunsch, Smith-Waterman.
Algorithms: blast, gapped blast, PSI-blast.
Algorithmic approaches for multiple sequence alignment: progressive, iterative (stochastic and non-stochastic), consistency-based.
Algorithms: T-Coffee, ClustalW, ClustalO, MAFFT, MUSCLE, TM-Coffee.

Profile Hidden Markov Models:
What is a profile Hidden Markov Model (HMM)?
How is the model represented as a data structure?
Viterbi algorithm
Algorithms for HMM building and scanning.

Curation, Annotation, and Ontologies:
The curation process recording knowledge gleaned from the scientific literature and experiments.
Human annotation following a curation protocol versus automated annotation by a computer algorithm/tool/classifier.
Ontology as a way for a community to share an agreed set of terminology, concepts, and relationships.
Importance of ontology for data sharing, integration, and machine representation of knowledge.
Familiar with Gene Ontology, ChEBI, ECO.
How Gene Ontology Annotation (GOA) database connects GO with SwissProt and model organism databases.

Phylogenomics and Orthologous Groups:
Protein family as a set of proteins related by evolution (or structure, or function).
The pfam database of protein families.
Evolutionary events of speciation and duplication, leading to orthologs and paralogs respectively.
Phylogenomics as a computational attempt to distinguish orthologs and paralogs.
Relationship between phylogenomics, clustering, and phylogenetic tree construction.
Orthologous groups as a computational definition of protein families.
eggNOG as a successor of COG and KOG.
Algorithm for the construction of eggNOG.
Algorithm for eggNOG-mapper to search eggNOG, given a set of protein sequences..
Algorithms to cluster protein sequences: Markov clustering (MCL), Transitivity clustering (TransClust), Heirarchical orthologous groups (HOG).

Learning Outcomes Critical Thought

Sequence Alignment
Sequence similarity and sequence homology and their relationship.
Understand grey area of percent identity of similarity as to interpretation of homology.

Profile Hidden Markov Models
HMMs and the detection of remote sequence homology.
Understand how to interpret output of hmmer classifiers.

Phylogenomics and Orthologous Groups
Distinction between phylogenetic trees for species, genes, and proteins.
Difficulty of distnguishing orthologs and paralogs derived from recent duplication events.
Relationship between determining family subgroups and specificity determining sites.

Learning Outcomes Skills:

Basic Skills:
Know how to access and query resources.
Understand fasta format for protein sequences, and how to manipulate them using appropriate algorithms for protein sequence analysis.

Sequence Alignment Skills:
Know how to run stand-alone blastall, set parameters, and create tabular output for protein sequences.
Know how to interpret score, e-value, coverage, and percent identity.
Know how to run T-Coffee, ClustalW, ClustalO, MAFFT, MUSCLE, TM-Coffee.

Profile Hidden Markov Models Skills:
Know how to run hmmer tools.

Curation, Annotation, and Ontologies Skills:
Know how to trace trees of GO relationships for a GO term.
Know how to interpret evidence codes.

Phylogenomics and Orthologous Groups Skills:
Know how to search the pfam database, give a protein sequence.
Know how to run eggNOG-mapper and analyse the results.
Know how to run various algorithms to cluster protein sequences based on sequence similarity.


Last modified on 6 September 2022 by Greg Butler