Monday, December 14, 2009

37) LAPIS: Converting blast hit sequences into fasta format

LAPIS is an Open Source software written in JAVA. One of its uses is to make data presentation simpler for analysis by extracting suitable information from a text file.

For example, it can be used to extract only the subject and the identifier (NCBI GI number) information from a BLAST result. Below is an illustration of how this can be done using LAPIS:

Before

>gi|126385999|gb|CP000521.1| Acinetobacter baumannii ATCC 17978, complete genome
Length = 3976747

Score = 570 bits (1470), Expect = e-163
Identities = 284/284 (100%), Positives = 284/284 (100%)
Frame = -2

Query: 1 LNFKFNFISLMNIKALLLITSAIFISACSPYIVTANPNHSASKSDEKAEKIKNLFNEAHT 60
LNFKFNFISLMNIKALLLITSAIFISACSPYIVTANPNHSASKSDEKAEKIKNLFNEAHT
Sbjct: 1766322 LNFKFNFISLMNIKALLLITSAIFISACSPYIVTANPNHSASKSDEKAEKIKNLFNEAHT 1766143


After

>gi|126385999
LNFKFNFISLMNIKALLLITSAIFISACSPYIVTANPNHSASKSDEKAEKIKNLFNEAHT

Methodology:
1. Select the line containing the fasta description together with the line containing the subject sequences: “line containing > or line containing sbjct” ->Tools ->Extract
2. To get rid of the numbers in the line containing sbjct: “digits in line containing sbjct” ->Tools -> Omit
3. To get rid of sbjct: “sbjct:” -> Extract -> Omit
4. To get rid of dashes: type “-" -> Extract -> Omit
5. To get rid of the extra spaces in the lines containing sequences: “spaces not in line containing >” -> Tools -> Omit
6. In case you want to clean up the description line to only have the GI
From second | in line containing > to start of linebreak


Screen shots: The following screen shots shows the input and the output at each step











For extracting specific information, the user needs to find a pattern and type it in the pattern box as shown below:













The pattern above is used to extract the two necessary lines for the further analysis.







Next the user should remove the positions (digits) in the subject line.













The screen shot below shows the highlighted digits to be omitted.







The screen shot below shows the information after omitting the numbers from the subject line.







Next the word “subj:” should be omitted also.



















The screen shot below shows the pattern to remove any extra spaces in the subject line. In case of any gaps (-), they should also be removed.




















The screen shot below shows the pattern to remove extra information from the header line.













The screen shot below shows the required output.

No comments:

Post a Comment