COMP 6811 Bioinformatics Algorithms

Fall 2022 Section FF


Calendar Description

The principal objectives of the course are to cover the major algorithms used in bioinformatics; sequence alignment, multiple sequence alignment, phylogeny; classifying patterns in sequences; secondary structure prediction; 3D structure prediction; analysis of gene expression data. This includes dynamic programming, machine learning, simulated annealing, and clustering algorithms. Algorithmic principles will be emphasized. A project is required.


Instructor Dr Greg Butler, Office ER-1049, Email: gregb@encs.concordia.ca

Lectures Wednesdays 1445-1715 MB 6.425

Course Outline course outline


Lectures

Lectures will present a general introduction into biology, genomics, biotechnology, and bioinformatics.

The introduction will include a discussion of the major resources for protein sequences, protein families, and their curation using ontologies.

The lectures on algorithms will assume a general background in algorithms and data structures, especially specification of algorithm requirements by pre-conditions and postconditions, simple complexity analysis for resource usage (time, memory, disk); and familiarity with hashing, string algorithms, dynamic programming, and greedy algorithm.

Bioinformatics algorithms will focus on algorithms for protein sequence analysis with the aim of understanding the algorithms required for the eggNOG system construction and search with eggNOG-mapper. This includes sequence alignment; profile Hidden Markov models; k-mer techniques for sequence analysis; clustering and phylogenetic tree construction; multiple sequence analysis; and phylogenomics.

There is no set textbook. Pointers to supporting material on the web will be provided. You will be required to read seminal papers on each of the algorithms.

Lecture Schedule and readings.

Resources (TO COME)


Assignments and Projects

There will be four written assignments.

There will be two projects: a small project due in the first half of the course; and a larger project with presentation and report due at the end of the course.

Both assignments and projects are individual work.

The large project will require a 15-minute presentation showing your results and a technical report of up to 10 pages in IEEE two-column format in Latex.

Assignments

Projects


Examinations

There will be a midterm examination.

There will not be a final examination.


Learning Outcomes

Learning Outcomes Knowledge:

Cell Biology:
Basics of Central Dogma of Genomics: dna, rna, nucleotide, amino acid; transcription, translation; gene, mRNA, protein sequences.
Proteins: amino acid sequences; primary, secondary, tertiary, quaternary structures.
Cell components: nucleus, mitochondrion, chloroplast (in plants), endoplasmic reticulum (ER), Golgi, vacuoles, membranes.
Cell processes: metabolism, catabolism, transport, regulation, signaling; cell cycle; cell death (apotosis).
Become familiar with microbial communities (microbiomes); host-microbiome interactions; relevance.
Become familiar with metagenomics and other meta-omics.

Sequence Alignment:
Alignment as a string matching problem.
Alignment as an optimization problem.
Local vs global alignment; Deterministic, exact alignment vs heursitic alignment; Pairwise vs multiple alignment.
Objective functions; scoring (substitution) matrices; gap penalty, gap opening, gap extension.
Algorithms: dynamic programming.
Algorithms: Needleman-Wunsch, Smith-Waterman.
Algorithms: blast, gapped blast, PSI-blast.
Algorithmic approaches for multiple sequence alignment: progressive, iterative (stochastic and non-stochastic), consistency-based.
Algorithms: T-Coffee, ClustalW, ClustalO, MAFFT, MUSCLE, TM-Coffee.

Profile Hidden Markov Models:
What is a profile Hidden Markov Model (HMM)?
How is the model represented as a data structure?
Viterbi algorithm
Algorithms for HMM building and scanning.

Curation, Annotation, and Ontologies:
The curation process recording knowledge gleaned from the scientific literature and experiments.
Human annotation following a curation protocol versus automated annotation by a computer algorithm/tool/classifier.
Ontology as a way for a community to share an agreed set of terminology, concepts, and relationships.
Importance of ontology for data sharing, integration, and machine representation of knowledge.
Familiar with Gene Ontology, ChEBI, ECO.
How Gene Ontology Annotation (GOA) database connects GO with SwissProt and model organism databases.

Phylogenomics and Orthologous Groups:
Protein family as a set of proteins related by evolution (or structure, or function).
The pfam database of protein families.
Evolutionary events of speciation and duplication, leading to orthologs and paralogs respectively.
Phylogenomics as a computational attempt to distinguish orthologs and paralogs.
Relationship between phylogenomics, clustering, and phylogenetic tree construction.
Orthologous groups as a computational definition of protein families.
eggNOG as a successor of COG and KOG.
Algorithm for the construction of eggNOG.
Algorithm for eggNOG-mapper to search eggNOG, given a set of protein sequences..
Algorithms to cluster protein sequences: Markov clustering (MCL), Transitivity clustering (TransClust), Heirarchical orthologous groups (HOG).

Learning Outcomes Critical Thought

Sequence Alignment
Sequence similarity and sequence homology and their relationship.
Understand grey area of percent identity of similarity as to interpretation of homology.

Profile Hidden Markov Models
HMMs and the detection of remote sequence homology.
Understand how to interpret output of hmmer classifiers.

Phylogenomics and Orthologous Groups
Distinction between phylogenetic trees for species, genes, and proteins.
Difficulty of distnguishing orthologs and paralogs derived from recent duplication events.
Relationship between determining family subgroups and specificity determining sites.

Learning Outcomes Skills:

Basic Skills:
Know how to access and query resources.
Understand fasta format for protein sequences, and how to manipulate them using appropriate algorithms for protein sequence analysis.

Sequence Alignment Skills:
Know how to run stand-alone blastall, set parameters, and create tabular output for protein sequences.
Know how to interpret score, e-value, coverage, and percent identity.
Know how to run T-Coffee, ClustalW, ClustalO, MAFFT, MUSCLE, TM-Coffee.

Profile Hidden Markov Models Skills:
Know how to run hmmer tools.

Curation, Annotation, and Ontologies Skills:
Know how to trace trees of GO relationships for a GO term.
Know how to interpret evidence codes.

Phylogenomics and Orthologous Groups Skills:
Know how to search the pfam database, give a protein sequence.
Know how to run eggNOG-mapper and analyse the results.
Know how to run various algorithms to cluster protein sequences based on sequence similarity.


Last modified on 6 September 2022 by Greg Butler