COMP 6811 Bioinformatics Algorithms

Fall 2022 Section FF


Assignments

Assignments are individual work.

You must submit to EAS (the electronic assignment submission system). You must submit the assignment in the format requested. No other format will be accepted (except zip, but only use zip if your internet connection makes it necessary to compress the file).

You must submit to the requested location in EAS, such as "theory_assignment 1", and you must submit to COMP6811 (and not some other course).

Late Penalty There will be a penalty of 10 percent for each day late.


Assignment 1

Deadline 10:59 am Friday 16 September 2022
Submission A pdf file to "theory_assignment 1" to EAS under COMP6811.

Part A Write a clear description of the Knuth-Morris-Pratt algorithm (KMP) to search a pattern w in a string S where your pattern and strings are over the DNA alphabet a, c, t, g.
You must include specifications in terms of pre-conditions on the inputs, and post-conditions on the output.

Part B Describe how you would use KMP (or modify KMP) for the problem of searching a collection of strings for the occurrences of a single pattern w.

Your report must be written in Latex using a suitable package for algorithms.


Assignment 2

Deadline 10:59 am Friday 30 September 2022
Submission A zip file to "theory_assignment 2" to EAS under COMP6811 containing directory with your source code, test input, and test output.

The task is to search strings over the DNA alphabet a, c, t, g for all k-mers of a given size k. You should return the number of different k-mers found; and a list of the k-mers found, with the frequency of their occurrence, sorted by order of largest to smallest frequency.

Part A Choose a suitable language, such as C, C++, Python, or Java, to implement a program for the task. Document your source code to describe the data structures used to represent the string, k-mer, and the collection of (k-mer, frequency) pairs. Document the major steps in your program, and indicate their complexity in terms of use of time and memory. Test your code with small strings (up to size 100) and small k (up to 5).

Part B Run your code on real examples, such as one chromosome from yeast, and one from mouse, for values of k from 3 to 8. Choose the longest chromosome.

Part C The genome of prokaryotes is circular, so you ned to make some modifications to run your code on the genome of E. coli a common bacteria. Then run your code on the genome for values of k from 3 to 15.


Assignment 3

Deadline 10:59 am Wednesday 16 November 2022
Submission A zip file to "theory_assignment 3" to EAS under COMP6811 containing your scripts, test input, and test output.

The task is to learn how to use Diamond with microbiome data.

Download and install Diamond.
Select a subset of about 1000 sequences from your selected microbiome data for your project.
Download and install the nr database (or a portion of it, depneding on your resources).
Use Diamond to search nr for your selected sequences. It should create a .daa file.


Assignment 4

Deadline 10:59 am Friday 25 November 2022
Submission A pdf file to "theory_assignment 4" to EAS under COMP6811 for a report of this task with appropriate screenshots from MEGAN.

The task is to apply MEGAN to the sample of microbiome data from Assignment 3.


Last modified on 6 September 2022 by Greg Butler