COMS30003 Exercise 2
Protein Sequence Analysis
This exercise will be marked in groups of two or three (all group members must submit identical answers in SAFE). It is worth 70% of the unit. A word of warning: this coursework may look easy, and it is not that hard, but be aware that some parts may take a long time to run! So if you do not leave yourselves plenty of time then you may find that you are unable to complete some parts before the deadline. Also you may find that, although some parts are simple in principle, difficulties in understanding/running software tools or their inputs/outputs mean that it may take longer than anticipated. In summary, aim to finish this well before the deadlinem, you have been warned …
The starting point for this coursework is a single protein domain. You must all choose a sequence independently. DO NOT deliberately choose the same sequence as somebody else.
Please submit your answers as a single plain text file called exercise2.txt. Clearly number the answers following the numbering scheme in the specification below, and do not include additional commentary or notes other than that which is asked for. If you need to include images, alignments, or any other files then simply give the files intuitive filenames and include them in your submission. In the text file you must refer to the files using the correct filename under the correctly numbered answer.
Each question answer should begin with a single line of exactly the form below. This will be used to automatically parse your answers for marking (get this wrong and some of your questions risk going unmarked):
The questions are of unequal difficulty. The ones labelled ‘optional’ are not necessary to complete the other questions, and may be much harder than the other questions, but are not worth much more marks. The penultimate section (questions 17-21) requires making a pipeline from some of the other questions, it is recommended that you start this while you are working on the early questions and work on your code in parallel with answering questions 1-16. The last section on online servers is less dependent on the other sections, so make sure you don’t miss it out even if you are struggling with the earlier sections. It is not adviseable to do these questions until you have done the others first. Approximate marks available are listed for each question.
To select the sequence
You must chose a sequence identifier following this procedure:
By browsing SUPERFAMILY find a sequence that has more than one domain, e.g. like this example.
Then select a SCOP domain superfamily from that sequence that satisfies the two conditions below, e.g. Cupredoxins from the example.
You can follow links from SUPERFAMILY or go to the SCOP website directly. Choose an individual entry from this SCOP superfamily, making a note of the PDB identifier and the chain and region if necessary. From the Cupredoxins example you could choose 1PLC which has one chain and region so no need to specify.
Condition 1: It must belong to a superfamily which has more than one family.
Condition 2: It should not be any of the following: TIM barrel, Rossmann fold, Beta propeller, Immunoglobulin, Globin, P-loop, Zinc finger.
Obtain pairwise sequence comparison software
You must download and install some pairwise sequence comparison software. Later on you will be running it on some large datasets so I advise you to do this on the department linux machines, but you are free to use any computer bearing in mind that you might have to copy some large datasets off the department linux filesystem if you do use some other computer.
You should download BLAST since this is the most commonly used algorithm. You can get it from the NCBI ftp site. Alternatively you can use the copy I have downloaded here:
After you have obtained BLAST you may download other methods as well for comparison, e.g. Smith-Waterman, FASTA, SSearch or others, but not until you have completed the tasks below using BLAST. Documentation is in the ‘doc’ directory of the installation.
Obtain sequence databases
You should go to the ASTRAL website and download the 1.75 version (to match SUPERFAMILY v. 1.75) “ASTRAL SCOP genetic domain sequence subsets, based on PDB SEQRES records, with less than 95% identity to each other”. Uniprot is very large (try “grep -c ‘^>’ Uniref100.fasta” to see how big) so you are not expected to download it, I have provided it for you here:
Search the databases
Please enter the information about the sequence you have chosen, including the identifier.
By examining the ASTRAL file which is in FASTA format, find the sequence which corresponds to the domain which you have chosen. Create a new FASTA format file of your own which contains only this sequence. Paste the contents of this file as you answer to question 1.
Using BLAST you should now search your sequence against the whole ASTRAL file which you downloaded. You will need to index the database first using the ‘makeblastdb’ program. The –help switch gives the options; remember, throughout this exercise there is no harm in getting help on the correct commands and arguments from the unit forum, Google, Researchgate and other online information resources. Look at your results and how the scores correspond to the SCOP classification. You may find the classification shorthand very useful, e.g. from the 1plc example, b.6.1.1 (the first letter corresponds to the SCOP class, then the numbers go fold, superfamily, family).
N.B. see question 17.
For question 2 please describe what scores you got for other domains within the SCOP hierarchy, and then explain whether this is what you should expect or not. Use less than 100 words.
Now you should search your chosen sequence against UniProt. I have already indexed UniProt for you. You will notice that it takes much longer than searching the ASTRAL sequences! Take a look at your results again; you should store your results in a file and include this in your submission. Put the search on the ASTRAL results in a file also and include it in your submission.
List the filenames of your two results files here. You probably got more hits searching UniProt than the ASTRAL sequences. How many sequences do you think are homologous? How did you decide which hits are significant and which ones are not? One line answer please.
Are any of the sequences matched in UniProt the same as in ASTRAL? Comment on why. One line answer please.
Make two new FASTA format files: one containing all of the sequences with significant hits in ASTRAL, and another with all the sequences with significant hits in UniProt. If there are a very large number of sequences then you will have to find a way to extract them via their identifiers from the UniProt FASTA file.
N.B. You may well find the PERL script listed on the unit page from the lecture on PERL useful here. You may also want to write your own script in PERL or Python or modify this one. If you have any questions or problems with PERL/Python/Ruby please do not hesitate to ask on the forum as you need to succeed in creating these FASTA files to be able to tackle some of the later questions.
Question 5 (optional)
If you want to try some other software for pair-wise sequence comparison, then rerun the searches on both databases and compare the results to BLAST. Is there any difference? Is this what you should expect? Less than 50 words.
Comment on the lengths of the sequences in each of the two FASTA files you have created. If the some of the sequences in the UniProt results are longer, why is this? If not, why might you have expected some to be longer? Less than 50 words.
Look at the names and annotation of the sequences in UniProt that have been found homologous. Do some of them look similar to the SCOP domain description of your chosen sequence? Do some of them look different? Explain why this might be. Use less than 100 words.
Obtain alignment software
Download any multiple sequence alignment tool. MUSCLE is one which should be easy to use. Create multiple sequence alignments from both of your FASTA format files. If the sequence alignment takes too long, then try different software or reduce the number of sequences. Include these alignment files in your submission.
N.B. see question 19.
List the filenames of your alignments here. Describe very briefly how you produced the alignments: what software did you use, any options you used or anything special you needed to do. What output format did you choose? One line or two should be enough detail.
When you ran your pair-wise sequence comparison it will have produced pair-wise alignments. If not make sure you rerun it to produce alignments. Compare the alignment of the initial sequence you started with in the multiple sequence alignment to the corresponding pair-wise alignment.
Do you see any differences? If so which one do you think is better, and why? Less than 50 words.
Examine the multiple alignments in two ways: graphically and numerically. You can download software for visual inspection of multiple sequence alignments but it may be easier to find an online service that allows you to do it on the web, e.g. Jalview. Here is a web service I wrote for generating simple statistics on multiple sequence alignments (scroll to the right for conserved residues). You may equally use any other tool for examining the columns of the alignment numerically. N.B. This service is quite flaky and sometimes doesn’t work. Please report any bugs to me.
You should also go to the SUPERFAMILY website and find the superfamily of your starting sequence. You can get this by searching for it by name or by PDB ID. Click on ‘models’ in the search result and then click on one of the model IDs to get an image of a hidden Markov model representing the domain. Compare the columns in your alignmet viewer to the numerical analysis of the columns and the corresponding columns in the picture of the hidden Markov model.
Question 10 (optional)
Make an image of one of your multiple sequence alignments and include it in your submission listing the filename here. You may need to only display a part of your alignment to get a good image. Also copy or crop one of the hidden Markov model images which correspond to the same region of the alignment and include this in your submission; list the filename here. Pick some of the statistics or numerical breakdown of the columns in your alignment and enter it in the answer to this question.
Question 11 (optional)
Comment briefly on what you see in commmon for certain columns between the multiple alignment, the statistics you wrote in question 10 and the columns of the hidden Markov model, referring to the specific column. Numbering or labelling of columns may help here. Less than 50 words.
Obtain HMM software
Download the HMMER3 package. You may also use the version installed at:
Documentation is available on the HMMER3 website.
The next step is to build a hidden Markov model from each alignment. You will need to use the ‘hmmbuild’ program. Now score your model against ASTRAL and UniProt using the ‘hmmsearch’ program. Submit the HMM model files and list the names here. How do your results compare to the BLAST searches? How do the results compare between the two different models? Why do you think you see these differences (if any)?Use less than 100 words.
Question 13 (optional)
You should try the searches both with and without the heuristics switched on (use the -max option) on the ASTRAL sequence set. The non-heuristic search of UniProt may take an extremely long time, so you should not run it on snowy but you can try it in the computer lab if you like (run using screen and a high ‘nice’ value). What are the effects on the results and on run time (if any) with and without heuristics?Use less than 50 words.
Question 14 (optional)
Produce multiple sequence alignments of your two FASTA files (as you did before question 8) but instead of using sequence alignment software, use one of your HMM models and the ‘hmmalign’ program. Submit the files and list them here. How does the speed compare to the multiple alignment software? One line.
Iterative model building
So far the profiles (HMMs) that you have built are from a single iteration. You could now in theory create a new FASTA file from the results of your HMM search on UniProt in question 12, then use the procedure in question 14 to produce an alignment and then build a new model as in question 12 and re-search UniProt … etc. getting more and more results. Luckily this process has already been automated. The HMMER3 program is ‘jackhmmer’ and the BLAST program is ‘psiblast’ (also referred to as PSI-BLAST). Documentation on PSI-BLAST can be seen here.
Run PSI-BLAST with 3 iterations on UniProt using your initial starting sequence. Submit and list your results file here. How many significant hits does this find? How does this compare to straighforward BLAST and your two HMM model searches? Is this what you should have expected to find? Less than 50 words.
Question 16 (optional)
Convert your PSI-BLAST profile (checkpoint file) to a hidden Markov model. You may need to rerun the PSI-BLAST search setting an option for outputting the checkpoint file. You can use a component (see the ‘convert_to_prc’ program described under downloads ner the bottom) of the PRC package that somebody in my research group wrote but it converts to HMMER2 format. Alternatively try the script in /home/media/gough/teaching/COMS30003/scripts/. The HMMER3 package has a program for converting from the old HMMER2 format to HMMER3. Alternatively get the PSI-BLAST alignment frmo the last iteration into a format that you can run ‘hmmbuild’ on from teh HMMER package to build your own model from the same alignment. Now use HMMER to search the HMM against UniProt and compare the results to the PSI-BLAST results. What is the difference in the number of significant hits, and why do you think this is? Less than 50 words.
Many bioinformatics tasks need to be repeated many times, and thus automated. This section requires you to write code for a pipeline that automates some of what you ahve done above. It is recommended that you develop this at the same time as you answer the previous relevant questions. You can use any language but it is strongly recommended that you use a scripting langauge such as Python, PERL, etc. N.B. you are asked to submit your program for each question, but you may submit one final program if you wish for all questions, indicating clearly in your answer which is the relevant part for that question.
Write a program which takes as in input the name of a file. This file will contain a single FASTA format sequence. Your program will search this input sequence against the ASTRAL file (also in FASTA format) using BLAST and output the identifiers of the significant hits; to do this you will need to find a way to parse BLAST output. Include this program in your submission and indicate which one it is and how to run it. One line.
Now extend your code so that instead of outputting the list of IDs, it uses this list to retrieve the sequences from the ASTRAL file and output a new file of FASTA sequences containing only those in your list, i.e. the FASTA sequences of those significant BLAST hits to your original input sequence. Run this for the following input file:
Include the program and the output file in your submission. One line.
Now extend your code so that (using the method of your choosing) it creates a multiple sequence alignment from your FASTA sequences. Include the program and the output alignment for the d1plc_.fa input file in your submission. One line.
Now extend your code so that (using the method of your choosing) it creates a phylogenetic tree in Newick format from your alignment. Include the program and paste the Newick tree in your submission. A few lines.
Visualise and plot your tree using any method of your choosing.
Now extend your code so that the input file can support multiple sequences in FASTA format in the input file and output multiple Newick trees into an output file (one Newick tree for each FASTA sequence seperately), by running your pipeline. Include the program and the output file containing multiple Newick trees for the multiple.fa input file in your submission. One line.
Please write a description of your pipeline and the justification for any choices made. The quality of your pipeline and code may affect your marks for 17-20 as well as this question. Up to 300 words.
This last section is to show that you can do many common bioinformatics tasks online without always having to download and run the software locally.
Find an online server where you can do the UniProt search from question 2 and repeat the search using the server. Write here the link to the page where you entered your sequence for the search. One line.
Find other online server(s) which do a similar search to question 17, but using a different algorithm to BLAST. You may not be able to find a server that searches UniProt, but any other equivalent or similar large database will do. List the URL where you submit the sequence, the database which you searched and the algorithm used in each case. A few lines.
Find an online server for multiple sequence alignment and repeat question 8 using different software than you used for question 8. List here the URL, software used, and submit the alignments and list their names here.
Using any online service, create a multiple sequence alignment of homologues to your sequence. Now take that multiple sequence alignment and generate some statistics or visualise the level of conservation in the columns of the alignment. There is a web page here: http://supfam.org/SUPERFAMILY/alignment_stats.html that may be able to help but it is slow and may not be entirely reliable, especially for large alignments and it can be very sensitive to the exact format of the alignment; you may want to find an alternative online resource. Write the URL of any statistics page, but also submit a copy of it (in case it is deleted from the server) and list the filename.
Write a short explanation of how the statistics at one or two sites relates to the corresponding position(s) in the structure (specifying the amino acid number in the PDB file) and say something about the relevance to the specific site in the structure. 50-150 words.
Find an online server for iterative profile or HMM building and use your initial sequence to generate a model or profile. For this answer show the URL where you submit your sequence, and submit the model produced listing the filename here. Extra credit for choosing one that is not PSI-BLAST or HMMER.