The principal objectives of the course are to cover the major algorithms used in bioinformatics: sequence alignment, multiple sequence alignment, phylogeny; classifying patterns in sequences; secondary structure prediction; 3D structure prediction; analysis of gene expression data. This includes dynamic programming, machine learning, simulated annealing, and clustering algorithms. Algorithmic principles will be emphasized.
This is not a theoretical course on algorithmic complexity.
Bioinformatics is a relatively new discipline dealing with the computational needs of genomics. Biology has become a data-intensive activity. This data must be analysed, mined, and applied. There are a broad range of questions asked by biologists, and these require a broad range of algorithmic techniques.
This semester there is an emphasis on multiple sequence alignment algorithms and their applications. You will need to read and understand these survey papers, and to implement several algorithms.
It is this material that will be the focus of the final examination.
Dan Gusfield. Algorithms on Strings, Trees, and Sequences. CUP, 1997.
Andreas Baxevanis and B.F. Francis Ouellette. Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins. John Wiley and Sons, 1998.
Richard Durbin, Sean R. Eddy, Anders Krogh, Graeme Mitchison. Biological Sequence Analysis. CUP, 1998.
Pierre Baldi and Soren Brunak. Bioinformatics: The Machine Learning Approach. MIT Press, 1998.
Teresa K. Attwood and David J. Parry-Smith. Introduction to Bioinformatics. Prentice-Hall, 1999.
General introductory material can be found in Lectures for Summer 2003 (internal only) that covers Sequence algorithms; Gene expression analysis; Sequence patterns and classifiers; Molecular structure modeling and prediction.
Students are required to complete three C++ programming assignments (60%) and a final examination (40%).
The programming assignments are done in teams of three. Each team member is to implement one major algorithm for multiple sequence alignment in C++ and following good object-oriented design practices. The team as a whole is to document their designs and implementations.
The final examination will be a formal three-hour examination on the contents of the course.
Assignment 1: due week 5
email .pdf file to gregb@cs by 18:00 on February 6, 2004
Write a document where you clearly explain the objects and steps involved in each of the algorithms. Clearly identify and describe the major data structures.
Assignment 2: due week 9
email .pdf file to gregb@cs by 18:00 on March 12, 2004
Write an object-oriented design for each of the algorithms. Try to identify common classes that can be used across several algorithms.
Assignment 3: due week 12
email .pdf file to gregb@cs by 18:00 on April 2, 2004
Document your implementations. Present performance results for how they compare with each other, and with the original C implementations.
Presentations in week 8 of your preliminary designs, and week 13 of your implementations and performance results.
See the general introductions (from a biologist's perspective) of sequence alignment and multiple sequence alignment.
Three algorithms that have available source code in C, and good descriptions in articles and documentation are ClustalW, MULTI-LAGAN, and POA. These are the algorithms for which you have to create good object-oriented designs and C++ implementations using the Standard Template Library and the BOOST Graph Library. There is a book on the BOOST Graph Library, which was originally the Generic Graph Component Library (GGCL).
Note added January 21, 2004: MULTI-LAGAN is designed to work with nucleotide sequences (ie DNA) rather than amino acid sequences (ie proteins), so it is recommended that you work with the T-Coffee algorithm instead.
Note added January 21, 2004: If your team has four members then I suggest that the fourth member do the PRRP algorithm (download the latest .tar.gz file). (prrp article)
Thompson J.D., Higgins D.G., Gibson T.J.; "CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice."; Nucleic Acids Res. 22:4673-4680(1994). (ascii text).
Documentation on the Algorithm.
Source Code. Get version 1.83 since it is the most recent.