$seoHelper.renderFullSimple($sitemeshPage,"{2} - {3}")
Page tree
Skip to end of metadata
Go to start of metadata

This tutorial has been modified slightly from one originally provided as part of the ANGUS bioinformatics course. Modifications have been made to accommodate following the tutorial on the HPCC instead of Amazon EC2.

Checking read quality with FastQC

When you get your sequences back from a sequencing facility, it’s important to check that they are high quality (garbage in, garbage out). In this tutorial, we’ll use software called FastQC which checks whether a set of sequence reads in a .fastq file exhibit any unusual qualities (which might indicate either low sequence quality, or interesting biological features in your sample).

Getting the data

The data used in this tutorial has already been preselected and downloaded for your convenience. It is located in the HPCC directory:

 

/mnt/research/common-data/Bio/AngusData

 

Simply copy the following files over to your working directory.  First, a "good" sequence in fastq format:

 

cp /mnt/research/common-data/Bio/AngusData/good_sequence_short.fastq .

 

Then a "bad" one:

 

cp /mnt/research/common-data/Bio/AngusData/bad_sequence_short.fastq .

 

Running FastQC

To run FastQC on the HPCC in interactive mode, you will need to establish an X-connection over SSH.  On workstations using the Mac or Linux operating system, simply open a terminal and enter:

For Windows users, you will need PuTTY and Xming or Cygwin-X to establish an X-connection over SSH.  You can follow these instructions for Xming, or stop by the HPCC and pick-up a preloaded thumb drive with the software you need.

Once you are connected to Gateway with an X-session, you will need to login to one of the dev-nodes before running FastQC:

 

ssh dev-amd09

 

Now, simply load the module file for FastQC (remember to do this on a dev-node):

 

module load FastQC

 

There are two ways in which FastQC can be run: in "command line" mode, or as a GUI (graphical user interface).  This tutorial addresses the command line version of FastQC.  Let's start by analyzing our "good" file:

 

fastqc ./good_sequence_short.fastqc

 

This will generate a self-contained directory called "good_sequence_short_fastqc" which contains an HTML formatted report that can be loaded into a browser. If we change into that directory and list the contents of the file "summary.txt" we can see which tests passed and which failed:

 

cd good_sequence_short_fastqc
cat summary.txt
PASS    Basic Statistics    good_sequence_short.fastq
PASS    Per base sequence quality   good_sequence_short.fastq
PASS    Per sequence quality scores good_sequence_short.fastq
WARN    Per base sequence content   good_sequence_short.fastq
PASS    Per base GC content good_sequence_short.fastq
PASS    Per sequence GC content good_sequence_short.fastq
PASS    Per base N content  good_sequence_short.fastq
PASS    Sequence Length Distribution    good_sequence_short.fastq
PASS    Sequence Duplication Levels good_sequence_short.fastq
PASS    Overrepresented sequences   good_sequence_short.fastq

 

If we were to open the file "fastqc_report.html" in a browser, we would see:


The image above presents only a small portion of the output you receive from FastQC. This has been provided only for demonstration purposes. Please scroll down through your FastQC results to see other useful charts and tables, or click on the links in the lefthand pane.


Now we can repeat this procedure using our file of "bad" sequences:

 

 

fastqc ./bad_sequence_short.fastqc

 

Which produces:

 

cd bad_sequence_short_fastqc
cat summary.txt
PASS    Basic Statistics    bad_sequence_short.fastq
FAIL    Per base sequence quality   bad_sequence_short.fastq
PASS    Per sequence quality scores bad_sequence_short.fastq
WARN    Per base sequence content   bad_sequence_short.fastq
WARN    Per base GC content bad_sequence_short.fastq
WARN    Per sequence GC content bad_sequence_short.fastq
PASS    Per base N content  bad_sequence_short.fastq
PASS    Sequence Length Distribution    bad_sequence_short.fastq
WARN    Sequence Duplication Levels bad_sequence_short.fastq
WARN    Overrepresented sequences   bad_sequence_short.fastq
FAIL    Kmer Content    bad_sequence_short.fastq

 

Running FastQC in GUI Mode

If you want to run FastQC in GUI mode, logon to the HPCC using an X-windows session, load the module file and start FastQC as follows:

 

module load FastQC
fastqc &

 

video has been prepared by the FastQC developers which illustrates how to use this application in GUI mode.