Comparing Protein Sequences
Proteins are typically described by the sequence of amino acids they are composed of. One of the
goals of bioinformatics is to develop software tools for the analysis of protein sequences. An
important task is to compare sequences of similar proteins to identify mutations and help
establish the relationships between different organisms as they evolve.
An important step in the comparison of two protein sequences is the identification of a Maximal
Unique Match (MUM), defined as a subsequence satisfying the following conditions: a) it is
found in both proteins. b) it must be unique in both proteins. c) it must be maximal in the sense
that it is not a subsequence of a larger sequence satisfying a) and b). Finding a MUM helps in the
process of sequence alignment. To avoid generating many small MUM sequences, it is
conventional to search for MUM sequences having at least a given size, typically 20.
Another useful step is to identify the first mutation (or mismatch between the two sequences)
that follows a given position in the sequence.
In this assignment, you will use two programs mum.cpp and nextMutation.cpp
(provided). The first compares two proteins sequences and searches for a MUM sequence of size
at least 20. The second compares two protein sequences and searches for the first mutation
following a given position. Both programs use a class Sequence that represents a protein’s
amino acid sequence, and use functions that perform the search for a MUM and for a mutation.
You will implement the Sequence class so that the programs mum.cpp and
nextMutation.cpp reproduce the example output files provided. The files Makefile,
Sequence.h, mum.cpp and nextMutation.cpp are provided and must not be modified.
You must implement the file Sequence.cpp.
Representation of protein sequence data
Protein sequences are provided in the form of text files obtained from the National Center for
Biotechnology Information (NCBI). The files conform to the FASTA format in which each
amino acid is represented by a capital letter in the range [A-Z]. The first line of a FASTA file
contains information identifying the sequence, starting with the character ‘>’ as for example
>QWE88920.1 surface glycoprotein [Severe acute respiratory syndrome coronavirus 2]
for the spike protein appearing on the surface of a certain variant of the SARS-CoV-2 virus. The
first line is followed by the amino acid sequence itself, as for example
EasyDue™ 支持PayPal, AliPay, WechatPay, Taobao等各种付款方式!
E-mail: email@example.com 微信:easydue