Greg Butler: Bioinformatics Algorithms Course

COMP 691 R Bioinformatics Algorithms

Winter 2004 Semester: January 5 to April 5, 2004

Lectures: Wednesdays 20:30 to 23:00 in H-439

The principal objectives of the course are to cover the major algorithms used in bioinformatics: sequence alignment, multiple sequence alignment, phylogeny; classifying patterns in sequences; secondary structure prediction; 3D structure prediction; analysis of gene expression data. This includes dynamic programming, machine learning, simulated annealing, and clustering algorithms. Algorithmic principles will be emphasized.

This is not a theoretical course on algorithmic complexity.

Bioinformatics is a relatively new discipline dealing with the computational needs of genomics. Biology has become a data-intensive activity. This data must be analysed, mined, and applied. There are a broad range of questions asked by biologists, and these require a broad range of algorithmic techniques.

Focus for This Semester

This semester there is an emphasis on multiple sequence alignment algorithms and their applications. You will need to read and understand these survey papers, and to implement several algorithms.

L. Duret and S. Abdeddaim, Multiple alignments for structural, functional, or phylogenetic analyses of homologous sequences. In Bioinformatics: Sequence, Structure and Databanks, editted by D. Higgins and W. Taylor, Oxford University Press, 2000.
C. Notredame, Recent progresses in multiple sequence alignment: a survey, Pharmacogenomics 3(1) (2002) 131-144.

It is this material that will be the focus of the final examination.

General Information Sources

Dan Gusfield. Algorithms on Strings, Trees, and Sequences. CUP, 1997.

Andreas Baxevanis and B.F. Francis Ouellette. Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins. John Wiley and Sons, 1998.

Richard Durbin, Sean R. Eddy, Anders Krogh, Graeme Mitchison. Biological Sequence Analysis. CUP, 1998.

Pierre Baldi and Soren Brunak. Bioinformatics: The Machine Learning Approach. MIT Press, 1998.

Teresa K. Attwood and David J. Parry-Smith. Introduction to Bioinformatics. Prentice-Hall, 1999.

General introductory material can be found in Lectures for Summer 2003 (internal only) that covers Sequence algorithms; Gene expression analysis; Sequence patterns and classifiers; Molecular structure modeling and prediction.

Evaluation

Students are required to complete three C++ programming assignments (60%) and a final examination (40%).

The programming assignments are done in teams of three. Each team member is to implement one major algorithm for multiple sequence alignment in C++ and following good object-oriented design practices. The team as a whole is to document their designs and implementations.

The final examination will be a formal three-hour examination on the contents of the course.

Assignments

Assignment 1: due week 5
email .pdf file to gregb@cs by 18:00 on February 6, 2004
Write a document where you clearly explain the objects and steps involved in each of the algorithms. Clearly identify and describe the major data structures.

Assignment 2: due week 9
email .pdf file to gregb@cs by 18:00 on March 12, 2004
Write an object-oriented design for each of the algorithms. Try to identify common classes that can be used across several algorithms.

Assignment 3: due week 12
email .pdf file to gregb@cs by 18:00 on April 2, 2004
Document your implementations. Present performance results for how they compare with each other, and with the original C implementations.

Presentations in week 8 of your preliminary designs, and week 13 of your implementations and performance results.

General Background for Multiple Sequence Alignment

See the general introductions (from a biologist's perspective) of sequence alignment and multiple sequence alignment.

L. Duret and S. Abdeddaim, Multiple alignments for structural, functional, or phylogenetic analyses of homologous sequences. In Bioinformatics: Sequence, Structure and Databanks, editted by D. Higgins and W. Taylor, Oxford University Press, 2000. This book is on Library Reserve at Webster Library.
C. Notredame, Recent progresses in multiple sequence alignment: a survey, Pharmacogenomics 3(1) (2002) 131-144.
J.D. Thompson, F. Plewniak, O. Poch, A comprehensive comparison of multiple sequence alignment programs, Nucleic Acids Research, 27, 13 (1999) 2682-2690.
Quality assessment of multiple alignment programs, Timo Lassmann and Erik L.L. Sonnhammer, FEBS Letters 529:126-130 (2002).
The web site at ExPasy has links to a number of alignment programs. Notable ones are ClustalW, DIALIGN and T-Coffee.
Recommended viewers and editors of multiple sequence alignments are JalView, CINEMA and SeaView. But the one I suggest that you work with is called Indonesia.
You can find examples of multiple sequence alignments in the BaliBase database of alignments. The sets of sequences can be used as test input for your program.

The Three Algorithms

Three algorithms that have available source code in C, and good descriptions in articles and documentation are ClustalW, MULTI-LAGAN, and POA. These are the algorithms for which you have to create good object-oriented designs and C++ implementations using the Standard Template Library and the BOOST Graph Library. There is a book on the BOOST Graph Library, which was originally the Generic Graph Component Library (GGCL).

Note added January 21, 2004: MULTI-LAGAN is designed to work with nucleotide sequences (ie DNA) rather than amino acid sequences (ie proteins), so it is recommended that you work with the T-Coffee algorithm instead.

Note added January 21, 2004: If your team has four members then I suggest that the fourth member do the PRRP algorithm (download the latest .tar.gz file). (prrp article)

ClustalW Information

Thompson J.D., Higgins D.G., Gibson T.J.; "CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice."; Nucleic Acids Res. 22:4673-4680(1994). (ascii text).

Documentation on the Algorithm.

Source Code. Get version 1.83 since it is the most recent.

Last modified on January 14, 2004 by gregb@cs.concordia.ca