Short read quality and trimming


Reminder: if you’re on Windows, you should install mobaxterm.

OK, you should now be logged into your Amazon computer! How exciting!

Prepping the computer

sudo chmod a+rwxt /mnt
sudo apt-get update
sudo apt-get install -y trimmomatic fastqc bowtie2

Data source

We’re going to be using a subset of chicken data from Ayers et al., 2013.

You can find the full data set on the Short Read Archive under the accession number on the paper:

We’ll be working with RNA from two pooled samples (male and female chick blastoderms) – here are the direct sample links:

Note that each sample has two replicates, and each replicate has two files.

Don’t download them, but if you were downloading these yourself, you would want the “Fastq files (ftp)”, both File 1 and File 2. (They each take a few hours to download!)

1. Copying in some data to work with.

We’ve loaded subsets of the data onto an Amazon location for you, to make everything faster for today’s work. Let’s grab the first two files:

cd /mnt
mkdir data
cd data

curl -O -L
curl -O -L

Now if you type:

ls -l

you should see something like:

-rw-rw-r-- 1 ubuntu ubuntu 7920534 Mar  4 14:49 SRR534005_1.fastq.gz
-rw-rw-r-- 1 ubuntu ubuntu 8229042 Mar  4 14:49 SRR534005_2.fastq.gz

These are 100,000 read subsets of the original two SRA files.

One problem with these files is that they are writeable - by default, UNIX makes things writeable by the file owner. Let’s fix that before we go on any further:

chmod u-w

1. Copying in some data to work with.

First, make a working directory:

mkdir /mnt/work
cd /mnt/work

Now, make a “virtual copy” of the data in your working directory, but under better names:

ln -fs /mnt/data/SRR534005_1.fastq.gz female_repl1_R1.fq.gz
ln -fs /mnt/data/SRR534005_2.fastq.gz female_repl1_R2.fq.gz

These are FASTQ files – let’s take a look at them:

less female_repl1_R1.fq.gz

(use the spacebar to scroll down, and type ‘q’ to exit ‘less’)


  • why are some files named SRR*?
  • why are some files named female*?
  • why are there R1 and R2 in the name?


2. FastQC

We’re going to use FastQC to summarize the data. We already installed ‘fastqc’ on our computer - that’s what the ‘apt-get install’ did, above.

Now, run FastQC on both of the female files:

fastqc female_repl1_R1.fq.gz
fastqc female_repl1_R2.fq.gz

Now type ‘ls’:


and you will see


We are not going to show you how to look at these files right now - you need to copy them to your local computer. We’ll show you that tomorrow. But! we can show you what they look like, because I’ve copied them somewhere public for you: female_repl1_R1.fq_fastqc/fastqc_report.html and female_repl1_R2.fq_fastqc/fastqc_report.html.


  • What should you pay attention to in the FastQC report?
  • Which is “better”, R1 or R2?


3. Trimmomatic

Now we’re going to do some trimming! We’ll be using Trimmomatic, which (as with fastqc) we’ve already installed via apt-get.

The first thing we’ll need are the adapters to trim off:

curl -O -L

Now, to run Trimmomatic:

TrimmomaticPE female_repl1_R1.fq.gz female_repl1_R2.fq.gz\
     female_repl1_R1.qc.fq.gz s1_se female_repl1_R2.qc.fq.gz s2_se \
     ILLUMINACLIP:TruSeq2-PE.fa:2:40:15 \

You should see output that looks like this:

Quality encoding detected as phred33
Input Read Pairs: 100000 Both Surviving: 96615 (96.62%) Forward Only Surviving: 3282 (3.28%) Reverse Only Surviving: 95 (0.10%) Dropped: 8 (0.01%)
TrimmomaticPE: Completed successfully


  • How do you figure out what the parameters mean?
  • How do you figure out what parameters to use?
  • What adapters do you use?
  • What version of Trimmomatic are we using here? (And FastQC?)
  • Are parameters different for RNAseq and genomic?
  • What’s with these annoyingly long and complicated filenames?
  • What do we do with the single-ended files (s1_se and s2_se?)

For a discussion of optimal RNAseq trimming strategies, see MacManes, 2014.


4. FastQC again

Run FastQC again:

fastqc female_repl1_R1.qc.fq.gz
fastqc female_repl1_R2.qc.fq.gz

And now view my copies of these files: female_repl1_R1.qc.fq_fastqc/fastqc_report.html and female_repl1_R2.qc.fq_fastqc/fastqc_report.html.

Let’s take a look at the output files:

less female_repl1_R1.qc.fq.gz

(again, use spacebar to scroll, ‘q’ to exit less).


  • Why are some of the reads shorter than others?
  • is the quality trimmed data “better” than before?
  • Does it matter that you still have adapters!?

5. Subset and trim the rest of the sequences

Now let’s download all the rest of the samples:

cd /mnt/data
curl -O -L
curl -O -L
curl -O -L
curl -O -L
curl -O -L
curl -O -L
chmod u-w *.gz

Go back to the work directory, and copy them in:

cd /mnt/work
ln -fs /mnt/data/SRR534006_1.fastq.gz female_repl2_R1.fq.gz
ln -fs /mnt/data/SRR534006_2.fastq.gz female_repl2_R2.fq.gz

ln -fs /mnt/data/SRR536786_1.fastq.gz male_repl1_R1.fq.gz
ln -fs /mnt/data/SRR536786_2.fastq.gz male_repl1_R2.fq.gz

ln -fs /mnt/data/SRR536787_1.fastq.gz male_repl2_R1.fq.gz
ln -fs /mnt/data/SRR536787_2.fastq.gz male_repl2_R2.fq.gz

TrimmomaticPE female_repl2_R1.fq.gz female_repl2_R2.fq.gz\
     female_repl2_R1.qc.fq.gz s1_se female_repl2_R2.qc.fq.gz s2_se \
     ILLUMINACLIP:TruSeq2-PE.fa:2:40:15 \

TrimmomaticPE male_repl1_R1.fq.gz male_repl1_R2.fq.gz\
     male_repl1_R1.qc.fq.gz s1_se male_repl1_R2.qc.fq.gz s2_se \
     ILLUMINACLIP:TruSeq2-PE.fa:2:40:15 \

TrimmomaticPE male_repl2_R1.fq.gz male_repl2_R2.fq.gz\
     male_repl2_R1.qc.fq.gz s1_se male_repl2_R2.qc.fq.gz s2_se \
     ILLUMINACLIP:TruSeq2-PE.fa:2:40:15 \

Next: Building a new reference transcriptome

LICENSE: This documentation and all textual/graphic site content is licensed under the Creative Commons - 0 License (CC0) -- fork @ github. Presentations (PPT/PDF) and PDFs are the property of their respective owners and are under the terms indicated within the presentation.