SnpEff is a toolkit for predicting and annotating the effects of gene variants. GATK (variant calling) supports SnpEff and is often used in combination with the toolkit (see the GATK tutorial for more information).
There are a couple of important tips in using SnpEff on the HPCC. This tutorial shows you the basics on how to get started.
SnpEff Configuration File
The SnpEff configuration file is essential for specifying key run parameters, including (most importantly), the location of the databases to be used for your analysis. For your convenience, a configuration file template has been provided for your use in the following common directory:
To use SnpEff, you first need to copy this file to your working directory, and make any necessary changes desired for your analysis. Note that the configuration parameter "data_dir" (the database location) defaults to the following path in your home directory space:
You may leave this as-is, providing you actually place your databases in this path. Otherwise, you will need to update it to match the correct location.
For more information on the nuances of other configuration parameters, please refer to the SnpEff documentation (refer to Section 3, "Configuration").
As of this writing, the developers state that there are over 2500 pre-built databases available for use with SnpEff. For most people, this means it should be unnecessary to build your own database. However, for those whose genome is not supported, please refer to the SnpEff documentation (Section 17, "Building a Database").
The current list of pre-built databases available for SnpEff can be obtained by using the following:
For your convenience, a list has been created and is available for your inspection in the common directory path:
Grep'ing on that file is probably the easiest method of finding out if your genome is supported.
The "supported_dbs" file also contains download URLs for obtaining pre-built databases, or you can always browse for them on the SnpEff Sourceforge website.
However, the recommended method of obtaining the most recent pre-built databases is to use the SnpEff command itself. For example:
Using the above would place a copy of the Human Genome in the directory ~/snpEff/data.
Note that grep'ing on the "supported_dbs" file yields the following:
Finally, make whatever changes that might be needed to the configuration file based on where you actually placed your database files (for example, if you placed them on scratch to maximize I/O performance).
Once you have made the necessary modifications to your configuration file, and downloaded your desired databases, you are ready to run SnpEff. As part of executing the SnpEff command, make sure to specify the location of the configuration file you wish to use for the run.
For more information on using SnpEff, please refer to the following documentation: