Wednesday, May 13, 2009

20) How to use Blastclust to identify representative sequences

A situation where this may be applicable is when you have too many sequences and you want to do a quick analysis by using a set of sequences that represent all the sequences that you have. For example, you have a dataset of 1000 sequences of length 600 amino acids each and you would like to do phylogenetic analysis on them using Neighbor joining algorithm. However, because of the large input dataset, it will be few days before you get your results. So, in this case, you could choose to just analyse sequences that represent all your 1000 sequences. Follow the steps below to get your representative sequences:

1. Go to http://toolkit.tuebingen.mpg.de/blastclust



2. Load your input sequence file and set the length to be covered to 100%, and the percent identity to 90% or to any other desired threshold.



3. You may either wait for the results to appear or wait for an email about the results from the server



4. Here is an example of the results





5. Save the resulting output (representative sequence fasta file), by clicking on the “save” button on the "Results" bar, and save your file with a “.Fasta” extension.



6. Check the unique representative fasta file by opening it in a text editor.
The file may look weird in Notepad, use Wordpad instead or get a good text editor,
such as Editplus or Notepad++, among others. You are done!




Content by: Asif M. Khan & Sye Bee
Posted by: Sye Bee
Edited by: Asif M. Khan

No comments:

Post a Comment