Greg Butler: Bioinformatics Algorithms Course

COMP 691 R Bioinformatics Algorithms

Winter 2005 Semester: January 5 to April 6, 2005

Lectures: Wednesdays 17:45 to 20:15 in H-523

Final Examination

The final examination is currently scheduled for April 20, 2005 from 19:00-22:00 in Room H539-1

The principal objectives of the course are to cover the major algorithms used in bioinformatics: sequence alignment, multiple sequence alignment, phylogeny; classifying patterns in sequences; secondary structure prediction; 3D structure prediction; analysis of gene expression data. This includes dynamic programming, machine learning, simulated annealing, and clustering algorithms. Algorithmic principles will be emphasized.

This is not a theoretical course on algorithmic complexity.

Bioinformatics is a relatively new discipline dealing with the computational needs of genomics. Biology has become a data-intensive activity. This data must be analysed, mined, and applied. There are a broad range of questions asked by biologists, and these require a broad range of algorithmic techniques.

Focus for This Semester

This semester the emphasisis on sequence alignment, multiple sequence alignment, phylogeny, and phylogenomics. You will need to read and understand survey papers, research articles, install and run software packages.

Course outline

General Information Sources

D. Higgins and W. Taylor, Bioinformatics : sequence, structure, and databanks : a practical approach. Oxford University Press, 2000. QH 324.2 B56 2000 Webster Library Reserve

Articles and web sites to come.

Evaluation

Students are required to complete three assignments (60%) and a final examination (40%). You will have to install and run several software packages for the assignments. Some programming may be involved.

The final examination will be a formal three-hour examination on the contents of the course. You must pass the examination in order to pass the course.

Assignments

Assignment 1: due week 5

Slight changes 26 January 2005

You are required to write a program that compares two assemblies formed from related sets of sequences (reads). For example, the assemblies might take the same input sequences and use different programs such as phrap and CAP3 to do the assembly; the assemblies might take the same input sequences and trim them differently before assembly; or the first assembly might treat a subset of the sequences used for the second assembly.

The input for your program will consist of three pieces of information: (1) a .ace file for the first assembly; and (2) a .ace file for the second assembly.

Your program must generate a text file which contains a report on the similarities and differences between the two assemblies. If the two assemblies are identical then the report should simply report that fact. If the assembly differs, then the report should include the number of contigs/singletons that differ, and for each one that differs report how they differ. The differences of interest are: (a) does a contig in assembly A consist of precisely the same reads as a contig in assembly B but have a different consensus sequence; (a.i) is the only difference in consensus sequence at the trimming of the ends; (a.ii) which reads have different offsets against the consensus; (a.iii) does a read change direction; (b) does a contig in assembly A consist of a subset of the corresponding contig in assembly B; (c) is a contig in assembly A a merging of several complete contigs/singletons in assembly B; (d) is a contig in assembly A a merging of some complete contigs/singletons in assembly B together with some subsets of contigs in assembly B.

The .ace format is described in the section entitled "ACE FILE FORMAT" in the documentation for consed. Following the section is a sample text file in .ace format for a single contig consisting of 8 reads.

Your program can be written in the language of your choice, but I suggest Java.

The internal website has sample .ace files for comparison.

Assignment 2: due week 9 (March 9) Email a pdf file

Analyse the 8 protein sequences given on the internal web page. For each of them individually, determine (a) whether they have a closely related entry in UniProt (SwissProt+Trembl); (b) whether there is a structure in the Protein Database (PDB) for a closely related entry; (c) what information is provided by InterProScan, SignalP, and PSORT II for the sequence; and (d) form a multiple sequence alignment of all 8 sequences using ClustalW, DIALIGN, and T-Coffee, and compare the three alignments from these programs.

What is the best annotation that you can derive from the above information for each of the sequences in terms of GO terms (for the 3 categories), EC number, and KEGG pathway?

Run the 8 sequences as a group through PipeAlign and analyse the results. How does the multiple sequence alignment from PipeAlign of the 8 sequences differ from those in (d)? Does the result from PipeAlign clarify the annotation of the 8 sequences when compared with (a)-(d) above?

What does Panther tell you about these 8 sequences?

Submit a 15-page report on your findings.

Assignment 3: due week 12
New deadline April 6, 2005.

Repeat the analysis of Table 1 and Table 2 in the paper Larrondo LF, Salas L, Melo F, Vicuna R, Cullen D. A novel extracellular multicopper oxidase from Phanerochaete chrysosporium with ferroxidase activity, Appl Environ Microbiol. 2003 Oct;69(10):6257-63 using the latest information available.

For Table 1, check the PDB for all known structures of MCOs (multi-copper oxidases) including ascorbate oxidase, laccases, and ferroxidases. Create a Table 1 for them, showing the location and sequences for Loops I, II, III, and IV.

Create a new Table 1a by running the sequences in your new Table 1 through InterPro and, instead of showing the above Loop regions, show the regions for the following domains and sites: (i) SSF49503 Cupredoxin; (ii) PF00394, PF07731, PF07732 Multicopper oxidase 1, 2, 3; and (iii) PS00079 and PS00080 Multicopper Oxidase Binding Site 1 and 2, but note that there may be two PS00079 site locations (one near position 120 and one near position 470 to 550) so indicate each separately in your table.

Are there any other structural features indicated in the PDB entries that are worth noting? If so, create a Table 1b and show the related regions.

For Table 2, create a dataset of sequences comprising all of (a) the 8 sequences of Assignment 2; (b) all the sequences in Table 1 of the paper; (c) all the sequences in Table 2 of the paper; and (d) all member sequences in subfamilies SF10, SF11, SF12, and SF13 of the Panther family "PTHR11709 Multi-copper Oxidase Related". Create a multiple sequence alignment using PipeAlign on the dataset with the multicopper oxidase from the paper as the "query sequence". Create an updated Table 2 for the mulitple sequence alignment from PipeAlign: include in the Table only those sequences retained by PipeAlign (but include any sequences that PipeAlign brought in from UniProt and retained in the alignment); organize the sequences into subfamilies according to PipeAlign; group those subfamilies into laccases, ascorbate oxidases, and ferroxidases families. Show the regions of the alignments for Loops I, II, III, and IV.

Create a Table 2a as you did for Table 1a using the positions for the InterPro domains and sites for the sequences in your Table 1 (ie do not lookup all the Table 2 sequences using InterPro) but listing the MSA for the sequences in your new Table 2. How do the positions of those domain and sites agree with the multiple sequence alignment from PipeAlign?

Submit a report of 10-15 pages on your findings.

Lectures (internal only)

Lecture Schedule

2005-01-05 Lecture 1: Course Introduction

2005-01-12 Lecture 2: Introduction to Bioinformatics

2005-01-19 Lecture 3: Sequence Analysis

2005-01-26 Lecture 4: Sequence Analysis

2005-02-02 Lecture 5: Sequence Annotation

2005-02-09 Lecture 6: Sequence Annotation

2005-02-16 Lecture 7: Phylogenomics - Introduction

2005-02-23 Mid-Semester Break

2005-03-02 Lecture 8: Phylogenomics - Multiple Sequence Alignment

2005-03-09 Lecture 9: Phylogenomics - High-quality MSA

2005-03-16 Lecture 10: Phylogenomics - Subfamilies

2005-03-23 Lecture 11: Phylogenomics - Paralogs and Orthologs

2005-03-30 Lecture 12: Phylogenomics - Visualization

2005-04-06 No Lecture

Last modified on March 22, 2005 by gregb@cs.concordia.ca