Saturday, March 21, 2099

All Posts

Tuesday, November 5, 2019

41) Running AVANA on a Mac / Linux machine

Although the AVANA tool is build using Java, the installer available on SourceForge (https://sourceforge.net/projects/avana/) is custom-built for Windows machines. In order to run it on a Mac or Linux machine, one needs to follow the steps below:

1) Download the tool executable from https://sourceforge.net/projects/avana/

2) Install the file by double clicking on the downloaded file. While installing, if you don't have certain java run time environment file, you will be promoted to download. Complete the download and installation of the Java files. Then try installing again. During the installation process, you would be asked where to install Avana. It would be preferred to not install this in the Applications folder, instead save it in an easy to find directory under your user profile, such "/Users/asif/Downloads/AVANA", where "AVANA" is a folder you created under the "Downloads".

3) Go to the folder where you installed AVANA. For this tutorial, let's say it is installed in "/Users/asif/Downloads/AVANA". Go to that folder and you should see a bunch of files, in particular a file entitled "runAvana.bat". We would need to either convert this Windows batch file (*.bat) to a Unix batch file (*.sh). To do this, either rename the windows batch file to "runAvana.sh" or create a new file with the name "runAvana.sh". You can choose to either keep or delete the "*.bat" file, it would not affect the execution of AVANA on a Mac / Linux machine.

4) Open the "runAvana.sh" file and delete the existing content inside. Then paste the following new content to the file:

#!/bin/bash

export AVANA_HOME=/Users/asif/Downloads/AVANA
export CLASSPATH=$AVANA_HOME/avana.jar:$AVANA_HOME/avana-gui.jar:$AVANA_HOME/lib/biojava-1.4.jar:$AVANA_HOME/lib/bytecode-0.92.jar:$AVANA_HOME/lib/commons-beanutils-1.7.0.jar:$AVANA_HOME/lib/commons-logging-1.1.jar:$AVANA_HOME/lib/commons-math-1.1.jar:$AVANA_HOME/lib/jcommon-1.0.0.jar:$AVANA_HOME/lib/jfreechart-1.0.1.jar

java -Xms256m -Xmx800m it.southdown.avana.ui.AvanaGUI


5) In your case, replace the path (/Users/asif/Downloads/AVANA) in "AVANA_HOME" line to where you have installed the tool.

6) Open the Terminal. Navigate to the folder (/Users/asif/Downloads/AVANA) on the Terminal.

7) Make the "runAvana.sh" file executable through the following command:

chmod a+x runAvana.sh

8) Execute Avana:

sudo ./runAvana.sh

9) Sudo is to give you admin access in case you need authorisation. Enter the administrator password of your computer when prompted.

10) Done! Enjoy measuring entropy!


Friday, August 27, 2010

40) How to include my own database ID along with the NCBI ID on the BLAST's result page?

1. Edit the sequence header as shown to include your own database id (eg.STAT_00044 from my database) in the fasta file

>gi|34392270|emb|CAE46396.1| STAT_00044 Dd-STATb protein [Dictyostelium discoideum]

2. formatdb

formatdb -i input.fasta -p T -o T

3. In the /usr/local/wwwblast-2.2.17 folder of BioSLAX, backup blast.html and blast.cgi

4. Change the content of blast.cgi as shown (pipe the output of blast.REAL and do sed to get the hyperlink of your own database Id included in the blast result)

#!/bin/csh -f

#
# $Id: blast.cgi,v 1.1 2002/08/06 19:03:51 dondosha Exp $
#

echo "Content-type: text/html"
echo ""

#setenv DEBUG_COMMAND_LINE TRUE
setenv BLASTDB db





# I used my own database STATdb_Id as an example below.

5. Output example

Monday, December 14, 2009

37) LAPIS: Converting blast hit sequences into fasta format

LAPIS is an Open Source software written in JAVA. One of its uses is to make data presentation simpler for analysis by extracting suitable information from a text file.

For example, it can be used to extract only the subject and the identifier (NCBI GI number) information from a BLAST result. Below is an illustration of how this can be done using LAPIS:

Before

>gi|126385999|gb|CP000521.1| Acinetobacter baumannii ATCC 17978, complete genome
Length = 3976747

Score = 570 bits (1470), Expect = e-163
Identities = 284/284 (100%), Positives = 284/284 (100%)
Frame = -2

Query: 1 LNFKFNFISLMNIKALLLITSAIFISACSPYIVTANPNHSASKSDEKAEKIKNLFNEAHT 60
LNFKFNFISLMNIKALLLITSAIFISACSPYIVTANPNHSASKSDEKAEKIKNLFNEAHT
Sbjct: 1766322 LNFKFNFISLMNIKALLLITSAIFISACSPYIVTANPNHSASKSDEKAEKIKNLFNEAHT 1766143


After

>gi|126385999
LNFKFNFISLMNIKALLLITSAIFISACSPYIVTANPNHSASKSDEKAEKIKNLFNEAHT

Methodology:
1. Select the line containing the fasta description together with the line containing the subject sequences: “line containing > or line containing sbjct” ->Tools ->Extract
2. To get rid of the numbers in the line containing sbjct: “digits in line containing sbjct” ->Tools -> Omit
3. To get rid of sbjct: “sbjct:” -> Extract -> Omit
4. To get rid of dashes: type “-" -> Extract -> Omit
5. To get rid of the extra spaces in the lines containing sequences: “spaces not in line containing >” -> Tools -> Omit
6. In case you want to clean up the description line to only have the GI
From second | in line containing > to start of linebreak


Screen shots: The following screen shots shows the input and the output at each step











For extracting specific information, the user needs to find a pattern and type it in the pattern box as shown below:













The pattern above is used to extract the two necessary lines for the further analysis.







Next the user should remove the positions (digits) in the subject line.













The screen shot below shows the highlighted digits to be omitted.







The screen shot below shows the information after omitting the numbers from the subject line.







Next the word “subj:” should be omitted also.



















The screen shot below shows the pattern to remove any extra spaces in the subject line. In case of any gaps (-), they should also be removed.




















The screen shot below shows the pattern to remove extra information from the header line.













The screen shot below shows the required output.

Sunday, September 27, 2009

36) OpenBabel GUI - file format converter



It is a chemical toolbox designed to speak the many languages of chemical data.

OpenBabel GUI for windows can be downloaded from the following website,


OpenBabel is also available for LINUX and MACINTOSH,

Basic operation:

OpenBabel converts chemical objects (currently molecules or reactions) from one file format to another. This interface is an alternative to a command line and has the same capabilities.

- Select the type of the type of the input file from the dropdown list.

- Click the "..." button and select the file. Its contents are displayed.

- Choose the output format and file in a similar way. You can merely display the output without saving it by not selecting an output file or by checking "Output below only..".

- Click the "Convert" button.

The message window below the button gives the number of molecules converted, and the contents of the output file are displayed.

By default, all the molecules in an input file are converted if the output format allows multiple molecules.



NOTE: This would be useful for the people who are handling different docking softwares, because the file formats for each docking software will differ, based on the type of algorithm used and the development of the software.

Saturday, September 26, 2009

35) Autodock 4.0 (Installation and working)


Autodock 4.0 can be found from the following website,

download the software according to your computer appropriate operating system,

Installation:
I.windows installation:
1.Double click on to the MGLTools setup file.
2.Click "Next"
3.Accept the license agreement
4. Select your favorite destination folder to install the files and click 'Next"
5. wait for a while for the files to get installed
6. click on to the "Non commercial-Install all these packages"
7. click on to each package license
8. click "continue" button below
9. If you wish to launch shortcuts on your desktop , you can just click "finish"
10. And there your are ---- you can start working.

II. Linux installation:
NOTE: Autodock need to install from /root/, it can be installed from the /home/ directory itself.

1. Download MGLTools-1.5.2-Linux-x86-Install grant execute permission and run the installer.
To grant execute permission, type the following command at a command prompt: chmod +x MGLTools-1.5.2-Linux-x86-Install .
You can also use file manager to grant execute permission (right-click on the downloaded file and select Properties).

2) In the install wizard, click Next, select the destination and click Next again.
The default destination is /usr/local/MGLTools-1.5.2 for the root user and $HOME/MGLTools-1.5.2 for all others.
Please wait a few moments until the "License Agreements for add-on packages" window appears.

3) Complete the "License Agreements for add-on packages" form.
See Filling add-on packages license form for details.

4) Finally, click Finish to complete installation,

The README file contains detailed information on how to setup your environmental variables for command line usage.


Meanwhile download the following as well,
click on the above link,
you will pop up with an registration form, kindly fill with necessary details and download the files which will be compatible to your operating system that your using.

This will contain 2 executable files,
1. autogrid
2.autodock.

NOTE: Make sure you install everything in the same destination folder including the 2 executable files

How to work on Autodock 4.0?

Here we go..!!

Follow the below steps,

*double click on to this icon

Your workspace will look like this,

,
Kindly use the pdb and ligand files which comes along with the installation of Autodock 4.0, will be present in (MGLTools/MGL ToolsPckgs/Pmv/doc). Use hsg1.pdb (protein) and ind.pdb (ligand)
Protein Preparation:   
File -  Read Molecule - Open the file.pdb. 
Select - Select from string - Type HOH* (to remove water molecules)
Edit -  Add  Polar only 
File - write PDB  save ur file in pdbfile.pdb extension just without changing the path.  just click and remove HETATM  Click sort nodes before saving .

Ligand Preparation:   
Ligand -   input  open ur ligand.pdb - just note down the summary of the ligand.  
Torsion tree -  choose root  
again go to torsion tree - detect root  
again go to torsion tree - show root expansion  
again go to torsion tree - choose torsions  just click ok  
again go to torison tree - set number of torsions  give the number of torisons that you get in the summary of the (TORSDOF) when u open the ligand. 
Output  Save the ligand file in the same directory in ligandfile.pdbqt   

Grid preparation:   
Select  select from string  just type your active site residues and then click ok   
Flexible residues - input macromolecule - choose macromolecule - click on your pdb from the panel displayed.  
Flexible residues - choose torsions from the currently selected residues - active site residues alone will be highlighted. 
Flexible residues - redisplay the macromolecule 
Flexible residues - save the files in both    flexible as (pdbfile_flex.pdbqt)   rigid as (pdbfile_rigid.pdbqt) .
Grid - macromolecule  - open- just open ur pdb_rigid.pdbqt file, ignore the warnings, if you have any metal ions in your  proteinn ,  just open ur grid file and u can find (0.00) for ur metal ions, just find the valency for them from net  and type maunally ther and again save it.you use this now. (if you dont hav metal ions), just ignore the warnings.  
set map types - choose ligand - click on ur ligand from the panel displayed  output - save in .gpf  edit gpf - dontt do anything just click ok for the time being 
Run -  run autogrid  just browse and locate ur autogrid.exe file , just browse and locate file .gpf (grid parameter file) and note that everything should be in the same directory, were you saved your pdb in protien preparation, dont change the path for your wish, just save were it is saving  just put your autodock.exe and autogrid.exe in the same directory were ur files has been saved  

Docking:  
Docking -  Macromolecule - set rigid file name - just again open ur pdb_rigid.pdbqt file  Ligand - choose ur ligand  search parameters- just delete one zero from the 2500000 (just make it to 4 zeros), otherwise it wil take very long time.  
Docking parameters- default  other option-autodock 4 parameters- default  output - save in ligandfile.dpf  
Edit DPF - just leave and click ok 
Run - run autock  just browse and locate ur autodock.exe file  just browse and locate file .dpf (dock parameter file) and then click ok.

Analysing the results:
open the widget "Analyse" and click open, locate the ligand.dlg file,
Next click on to Macromolecule found below Dockings, just click open, it will open by default, if you cannot able to view your protein properly, simply press "N" button from the keyboard.


To view your results click on to conformations -play,

It will open a widget shown in the picture.


click on to the ampersand (&) symbol.

Analyse your results..!!

NOTE: kindly save all your files in the same folder where you have been installed your MGLTools and executable files, if you want to run the programme from your favourite folder of your choice, make sure that you place just the executable files alone in the place were you have saved your all  files that you have been used during the course of working on Autodock, because without these executable files the programme wont run properly.       

For more details regarding the algorithm,f orcefields and interpretation of the data kindly go through the Autodock manual which is available at below mentioned web site, 
http://autodock.scripps.edu/faqs-help/tutorial/using-autodock-4-with-autodocktools

34) How to set up web blast using BioSlax

At this stage you should have the bioslax installed, if not please refer to

A guide to using BioSlax and/or Installing BioSlax as a full linux system to your hard drive


1. First, get your sequence dataset ready in FASTA format

2. Create a directory (for example: /root/Desktop/db) and save the FASTA file of interest to this newly created directory



3. Now convert the FASTA formatted sequences into a blastable format, to do this, use the program “formatdb” which available for download from NCBI at http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=Download

*For BioSlax user, you do not need to download formatdb for BioSlax because it comes preinstalled

4. At console terminal, use this command to invoke Formatdb:
formatdb –i -p
Examples:
#formatdb –i ecoli.nt –p F
#formatdb -i swissprot -p T -o T

–p => type of sequence file;
T(rue) => the fasta file contains protein sequences
F(alse) => the fasta file contains nucleotide sequences
-o T => create an index file so that it is faster to run BLAST (optional)



If the command is successful, a bunch of new files are created.



5. Now, go to K-menu => BioSLAX => Desktop Apps => BLAST DB setup



6. Fill up or browse the path of the directory containing the formatdb formatted sequence files (the directory you created at part 1) => click on “Set BLAST DB Now!” for Web Blast to do the necessary setup configurations



* Please note that the example “/root/Desktop/db” directory is fine if it is on a hard disk or a “configsave” has been done. If it's on a machine booted by LiveCD & no “configsave” is done, the files will be lost when the system is shutdown. Therefore, we recommend that user to create the database outside the virtual environment, i.e in a USB hdd or thumb drive.

7. We are done, below are the results you will see:



BLAST result page



The location of your database after setting up BLAST Db: /usr/local/wwwblast-2.2.17/db






33) How to set-up the bioslax laptop servers

At this step you should have the bioslax installed in your laptop, if not please refer to
A guide to using BioSlax and/or Installing BioSlax as a full linux system to your hard drive


1. First of all, you need to change the password:

  1. Run the Konsole terminal
  2. bioslax#: passwd
  3. follow the instructions on the screen (please don’t set simple password)



2. To change host name: (skip this step if it has already been done)

  1. Run the Konsole terminal
  2. root@slax: pico /etc/HOSTNAME
  3. Change “bioslax.maxonline.com.sg” to your host+domain name (e.g.group01.bic.nus.edu.sg).
  4. Reboot

3. If the previous step 2 did not work, do the following, otherwise skip to step 4:

  1. Run the Konsole terminal
  2. bioslax#: cd /etc/rc.d
  3. bioslax#: pico rc.M
  4. You will be in the file rc.M. Type ctrl-W to search for "HOSTNAME". Type ctrl-W again, press enter and it will search the next word containing "HOSTNAME". You should be looking for the line "Set hostname by using DHCP response".
  5. Comment the whole paragraph from "set hostname by using DHCP response" to "done" by typing # at the start of each line.
  6. Save the changes by typing ctrl-X.
  7. Run the Konsole terminal again.
  8. bioslax#: netconfig
  9. There will be a blue window opened where you can enter the hostname, e.g."group01".
  10. Enter the domain name, e.g."bic.nus.edu.sg".
  11. Select "DHCP" as the network connection.
  12. At the "Set DHCP Hostname" window, leave it blank and press OK.
  13. Confirm the changes.
  14. Reboot

4. Set password for MYSQL

  1. Log in to Webmin (e.g. type http://group01.bic.nus.edu.sg:10000 for remote access) in Mozilla Firefox.
  2. Go to Servers
  3. Select MySQL Database Server
  4. Go to User Permissions
  5. Search for the row containing user “Anonymous”, host “localhost” and click on the "Anonymous" link to change the settings for that row:
    Under “Permissions” Click “Select Table Data” (it will become highlighted) Click “Save”
  6. Search for the row containing user “Anonymous”, host “bioslax” and click on the "Anonymous" link to change the settings for that row:
    Under “Hosts” click the radio button next to the text-box and in the textbox type in “groupXX”
    Under “Permissions” Click “Select Table Data” (it will become highlighted)
    Click “Save”
  7. Search for the row containing user “root”, host “localhost” and click on the "Anonymous" link to change the settings for that row:
    Under “Password” click the radio button for “Set to” and fill in the root password in the text box
    Click “Save”
  8. Search for the row containing user “root”, host “bioslax” and click on the "Anonymous" link to change the settings for that row:
    Under “Password” click the radio button for “Set to” and fill in the root password in the text box
    Under “Hosts” click the radio button next to the text-box and in the textbox type in “groupXX”
    Click “Save”


5. Three different ways to access the server remotely

  1. WinSCP (Freeware; http://winscp.net/).

    1. Hostname: group02.bic.nus.edu.sg
    2. Port number: 22
    3. Username: root (or any other username that you may have created on the server)
    4. Password: (type your password)
    5. Protocol: SFTP (allow SCP fallback)
    6. Click “Login”. If prompted for security alert, click “Yes”.

  2. Putty (Freeware; http://www.putty.nl/).

    1. Hostname: group02.bic.nus.edu.sg
    2. Port number: 22
    3. Connection type: SSH
    4. Click “Open”. If prompted for security alert, click “Yes”.
    5. Login as: root (or any other user id)
    6. Password: (type your password)

  3. Web browser (URL:http://group02.bic.nus.edu.sg)

    1. To access files from the web, you should save your web pages in the “/var/www/htdocs” folder
    2. To access pre-installed programs in Bioslax, such as blast and clustalw:

      1. http://group01.bic.nus.edu.sg/blast/
      2. http://group01.bic.nus.edu.sg/clustalw/
      3. http://group01.bic.nus.edu.sg/webphylip/
      4. http://group01.bic.nus.edu.sg/t_coffee/
      5. http://group01.bic.nus.edu.sg/wemboss/
      6. http://group01.bic.nus.edu.sg/sms2/

Wednesday, September 2, 2009

32) Duplicates Finder - a java standalone application to find sequence duplicates in your dataset

Protein (Nucleotide) Sequences downloaded from NCBI Taxonomy Browser or other similar databases for an organism, in our case viruses, contain sequences which are duplicates of each other. Sometimes, these sequence duplicates turn out to be duplicate GenPept records, as found during our analyses, which need to be removed from the dataset to remove bias.

One could use the ‘Remove duplicates’ function in Excel to do this; it returns the list of unique sequences. However, Excel does not do a good job when the dataset is large (large number or long sequences).

Here, we provide a simple desktop application to extract the sets of duplicates from a dataset and also provide the list of unique sequences from the input dataset.

How to use the program:

  1. Download the zip file DuplicateFinder.zip here. Extract contents to a desired location.

  2. Click on the DuplicatesFinder.jar, the program starts and you are good to go. Since this is a java program, it is OS independent.
    (NOTE: You need to have java runtime environment (JRE) installed in your system to run the program. To check if your system has java installed, refer to the end of this post.)

Input Files accepted by the program:
  1. Comma separated values (*.csv)
    • The data should contain two colums – Identifier number, Sequence.
      Identifier is a unique number to identify the sequence.
      Sequence is the set of sequences to be analyzed.

  2. FASTA (*.fasta) or Tab delimited (*.txt)
    • The sequences should be in fasta format as shown below
      >716349
      uquiertyqyjej
      quiejhieq
      >837468
      oiweirjtyiwj

Output Files created by the program:
The output files are created in the same directory and folder as the input file.
Two output files are created:

  1. Duplicates sequences File: A file containing the sets of duplicates (Duplicates.csv). The columns in the file are:
    • Duplicate Set – the index of groups of duplicates;

    • Count – the index of a duplicate sequence in the set of duplicates it belongs to. So each series of numbers from 1 to n represents a set of sequence duplicates with a total of n duplicate sequences in that set;

    • Sequence – the duplicate sequences in ascending order (for easy visual checking).

    • Identifier – the number from the identifier column in the .csv file or the identifier number after '>' in the .fas and .txt files.

    • Remarks – remarks if any.

  2. Unique sequences file: A file containing the unique sequences from the dataset in fasta format (Unique.fas). The file includes one representative from each set of duplicates.


NOTE: In both output files, the sequences are arranged in ascending order of the sequence.

A Working Example
  • Input file with 3 unique sequences and 2 groups of duplicates
    (Group 1 – 1A and 1B; Group 2 – 2A, 2B and 2C)


  • Start the Duplicate Finder program. Click ‘Open’. A file chooser with file filtering set to show only *.csv, *.fas or *.txt files pops up. Select your input file (here, Sample.fas). Click ‘Open’.


  • The path of your input file is set. Click ‘Generate results’. Wait for the confirmation window saying “Done” to indicate that your results have been generated. Click ‘OK’.


  • Go to the folder containing your input file. You will find two new files created with names Duplicates.csv and Unique.fas. In our case, our input file was Sample. So the results are SampleDuplicates.csv and SampleUnique.fas.


  • Output file 1: SampleDuplicates.csv
    Duplicate column: index for groups of duplicates.
    In our example, there were two groups of duplicates in the input file.
    Count column: the index of a duplicate sequence in the group of duplicates it belongs to.
    In our example, group 1 had two duplicate sequences – Duplicate1A and Duplicate1B – both of which are indicated in the Sequence column, followed by their identifier in the next column. Group 2 had three duplicate sequences – Duplicate2A, Duplicate2B and Duplicate2C – which are the three sequences indicated.
    Sequence column: the duplicate sequences in ascending order (for easy visual checking).
    Identifier column: the number from the identifier column in the .csv file or the identifier number after '>' in the .fas and .txt files


  • Output file 2: SampleUnique.fas
    This file contains the set of unique sequences from the input dataset in fasta format. The set of unique sequences includes one representative from each set of duplicates, usually the first one in the set of duplicates.
    In our example, we have 3 unique sequences – Unique1, Unique2 and Unique3 – which are present in the output file. In addition, the output also contains one representative from each set of duplicates – Duplicate1A from the first set and Duplicate2A from the second set.


To check if your system has JRE:
  1. Go to command prompt and type java –version.


  2. If your system has java, you will see the version details similar to one shown:


a) If your system does not have java installed, you get the error:
‘java’ is not recognized as an internal or external command, operable program or batch file.
b) You can then download the JRE for your OS from Sun Microsystems: http://www.java.com/en/download/manual.jsp)

=UPDATE=

This program is only able to remove exact EQUAL LENGTH duplicate sequences; it does not treat subset duplicates as duplicates. For example, let's say we have the following sequences:

>1
AAAAAAAAA
>2
AAAAAAAA
>3
AAAAAAAA

The output of unique sequences from the program are:

>1
AAAAAAAAA
>2
AAAAAAAA

You can see that though >2 is a subset duplicate of >1, the program still considers it as unique because it only removes exact same length duplicate sequences, such as >2 and >3, where one was removed and the other kept.

What happens if you have gaps in your input sequences? Let's say we have the following input sequences:

>1
AAAAAAAAA
>2
AAAAAAAA-
>3
AAAAAAAA-
>4
AAAAAAAAAB
>5
AAAAAAAAAB
>6
AAAAAAAAAC

The output that you get from the program is:

>2
AAAAAAAA-
>1
AAAAAAAAA
>4
AAAAAAAAAB
>6
AAAAAAAAAC

Just like in the previous example, only exact equal length duplicates are removed and the gap is treated like an amino acid. So, if you do not want gaps to affect the removal of duplicates (especially if they are in the middle of the sequences), you should remove them using the simple replace function in most text editors (replace "-" with nothing).

So what if you are interested in removing subset duplicates? What program can you use?
You can use Jalview and a tutorial is available here.

Content by: Rashmi & Asif M. Khan
Posted by: Rashmi
Edited by: Asif M. Khan

Thursday, August 20, 2009

31) Multiple sequence titles in the same BLAST hit

We were curious about what it means when BLAST returns a BLAST hit indicated by a “>” but followed by multiple entries with different accession numbers.

Are these sequences the same? And are they the same only at the aligned region or in the rest of the sequence as well?

I did a check on how these sequences are related. Based on several examples (two listed below), for now, it could be concluded that these entries that fall under the same BLAST hit are full sequence duplicates. However, in terms of annotations they could be from different isolates.

Case 1: 2 accession numbers under the same BLAST hit

Alignment between full sequences of Q6R325.1 and AAS01728.1
  • 100% identity, i.e. full sequences are exactly the same
  • Same author
  • Swissprot entry (Q6R325.1) has cross-reference to Genbank entry (AAS01728.1)
Case 2: Multiple accession numbers under the same BLAST hit

Multiple alignment of full sequences of all sequence titles
  • Perfect alignment
  • Same author
  • Only the isolate annotation is different

Wednesday, August 19, 2009

30) Using VLOOKUP (or HLOOKUP) in Microsoft Excel

VLOOKUP is a very useful formula in Excel to perform tasks such as:
  • Data comparions between sheets
  • Checking for discrepancies
  • Linking data between sheets
and the list can go on!

The formula looks like this:
VLOOKUP(lookup_value,table_array,col_index_num,range_lookup)

--
lookup_value: The value to search in the first column of the table array

table_array: Two or more columns of data. The values in the first column of table_array are the values searched by lookup_value. These values can be text, numbers, or logical values. Uppercase and lowercase text is equivalent.

col_index_num: The column number in table_array from which the matching value must be returned. A col_index_num of 1 returns the value in the first column in table_array; a col_index_num of 2 returns the value in the second column in table_array, and so on. If col_index_num is:
  • Less than 1, VLOOKUP returns the #VALUE! error value.
  • Greater than the number of columns in table_array, VLOOKUP returns the #REF! error value.

range_lookup: A logical value that specifies whether you want VLOOKUP to find an exact match or an approximate match:
  • If FALSE, VLOOKUP will only find an exact match. In this case, the values in the first column of table_array do not need to be sorted. If there are two or more values in the first column of table_array that match the lookup_value, the first value found is used. If an exact match is not found, the error value #N/A is returned.
  • If TRUE or omitted, an exact or approximate match is returned. If an exact match is not found, the next largest value that is less than lookup_value is returned.
--

The values in the first column of table_array must be placed in ascending sort order; otherwise, VLOOKUP may not give the correct value. You can put the values in ascending order by choosing the Sort command from the Data menu and selecting Ascending. For more information, see Default sort orders.

A good reference for more details of the formula can be found here.
These details are illustrated in the following example.

Step 1:
a) Once you have your table_array (i.e. your “reference dataset” with the first column containing), and the list which you would like to lookup for, enter the VLOOKUP formula in the cell as follows:

b) Note that in the formula, I have put in a “$” sign for the data source table indexes. This is to lock the table_array as A2:B5 and not allow it to slide as we drag the formula down to the rest of the cells.
This is what it’ll look like if I had not put the “$” sign there:

Notice that the data source table_array indexes have shifted (green box) and hence the value “1” could not be found. This will result in a #N/A error, as follows:

To avoid this, it is safer to use the “$” sign in your formula as shown in the Step 1(a).

Step 2: Hover on the edge of the cell, till your cursor becomes a “+”.
Click and drag the formula down to fill the rest of the cells.


Step 3: And this is what you’ll get:

For data comparions from different sheets:
If your source/reference data is stored in another sheet, simply add the sheet name at the beginning of the [table_array] field, like this:

What if I have a horizontal set of data as my table_array?
The VLOOKUP formula described above is used when the table_array is vertical and the lookup values are in the first column.
Use HLOOKUP when the table_array is horizontal and the lookup values are in the top row. The other details, such as using the “$” sign, remain the same.

What does #N/A mean?
The error #N/A indicates that the source for a particular data was not found in the table_array. This could mean either (i) your table_array had shifted, as explained above in Step 1(b) or (ii) the data of interest is not present in the reference table.

Counting the #N/A's
Sometimes we might be interested in the number of data entries that do not have a match in the table_array (the reference table).

To count the number of #N/A errors in a dataset, use this formula:

=IF(ISNA(A1),1,0)

This prints 1 if cell "A1" contains #N/A, else prints 0. Sum up the column to obtain a total.