Note
Reminder: if you’re on Windows, you should install mobaxterm.
OK, you should now be logged into your Amazon computer! How exciting!
sudo chmod a+rwxt /mnt
sudo apt-get update
sudo apt-get install -y trimmomatic fastqc bowtie2
We’re going to be using a subset of chicken data from Ayers et al., 2013.
You can find the full data set on the Short Read Archive under the accession number on the paper: http://www.ebi.ac.uk/ena/data/view/SRA055442
We’ll be working with RNA from two pooled samples (male and female chick blastoderms) – here are the direct sample links:
http://www.ebi.ac.uk/ena/data/view/SAMN01096082
http://www.ebi.ac.uk/ena/data/view/SAMN01096083
http://www.ebi.ac.uk/ena/data/view/SAMN01096084
http://www.ebi.ac.uk/ena/data/view/SAMN01096085
Note that each sample has two replicates, and each replicate has two files.
Don’t download them, but if you were downloading these yourself, you would want the “Fastq files (ftp)”, both File 1 and File 2. (They each take a few hours to download!)
We’ve loaded subsets of the data onto an Amazon location for you, to make everything faster for today’s work. Let’s grab the first two files:
cd /mnt
mkdir data
cd data
curl -O -L http://dib-training.ucdavis.edu.s3.amazonaws.com/mRNAseq-semi-2015-03-04/SRR534005_1.fastq.gz
curl -O -L http://dib-training.ucdavis.edu.s3.amazonaws.com/mRNAseq-semi-2015-03-04/SRR534005_2.fastq.gz
Now if you type:
ls -l
you should see something like:
-rw-rw-r-- 1 ubuntu ubuntu 7920534 Mar 4 14:49 SRR534005_1.fastq.gz
-rw-rw-r-- 1 ubuntu ubuntu 8229042 Mar 4 14:49 SRR534005_2.fastq.gz
These are 100,000 read subsets of the original two SRA files.
One problem with these files is that they are writeable - by default, UNIX makes things writeable by the file owner. Let’s fix that before we go on any further:
chmod u-w
First, make a working directory:
mkdir /mnt/work
cd /mnt/work
Now, make a “virtual copy” of the data in your working directory, but under better names:
ln -fs /mnt/data/SRR534005_1.fastq.gz female_repl1_R1.fq.gz
ln -fs /mnt/data/SRR534005_2.fastq.gz female_repl1_R2.fq.gz
These are FASTQ files – let’s take a look at them:
less female_repl1_R1.fq.gz
(use the spacebar to scroll down, and type ‘q’ to exit ‘less’)
Question:
Links:
We’re going to use FastQC to summarize the data. We already installed ‘fastqc’ on our computer - that’s what the ‘apt-get install’ did, above.
Now, run FastQC on both of the female files:
fastqc female_repl1_R1.fq.gz
fastqc female_repl1_R2.fq.gz
Now type ‘ls’:
ls
and you will see
female_repl1_R1_fastqc.html
female_repl1_R1_fastqc.zip
female_repl1_R2_fastqc.html
female_repl1_R2_fastqc.zip
We are not going to show you how to look at these files right now - you need to copy them to your local computer. We’ll show you that tomorrow. But! we can show you what they look like, because I’ve copied them somewhere public for you: female_repl1_R1.fq_fastqc/fastqc_report.html and female_repl1_R2.fq_fastqc/fastqc_report.html.
Questions:
Links:
Now we’re going to do some trimming! We’ll be using Trimmomatic, which (as with fastqc) we’ve already installed via apt-get.
The first thing we’ll need are the adapters to trim off:
curl -O -L http://dib-training.ucdavis.edu.s3.amazonaws.com/mRNAseq-semi-2015-03-04/TruSeq2-PE.fa
Now, to run Trimmomatic:
TrimmomaticPE female_repl1_R1.fq.gz female_repl1_R2.fq.gz\
female_repl1_R1.qc.fq.gz s1_se female_repl1_R2.qc.fq.gz s2_se \
ILLUMINACLIP:TruSeq2-PE.fa:2:40:15 \
LEADING:2 TRAILING:2 \
SLIDINGWINDOW:4:2 \
MINLEN:25
You should see output that looks like this:
...
Quality encoding detected as phred33
Input Read Pairs: 100000 Both Surviving: 96615 (96.62%) Forward Only Surviving: 3282 (3.28%) Reverse Only Surviving: 95 (0.10%) Dropped: 8 (0.01%)
TrimmomaticPE: Completed successfully
...
Questions:
For a discussion of optimal RNAseq trimming strategies, see MacManes, 2014.
Links:
Run FastQC again:
fastqc female_repl1_R1.qc.fq.gz
fastqc female_repl1_R2.qc.fq.gz
And now view my copies of these files: female_repl1_R1.qc.fq_fastqc/fastqc_report.html and female_repl1_R2.qc.fq_fastqc/fastqc_report.html.
Let’s take a look at the output files:
less female_repl1_R1.qc.fq.gz
(again, use spacebar to scroll, ‘q’ to exit less).
Questions:
Now let’s download all the rest of the samples:
cd /mnt/data
curl -O -L http://dib-training.ucdavis.edu.s3.amazonaws.com/mRNAseq-semi-2015-03-04/SRR534006_1.fastq.gz
curl -O -L http://dib-training.ucdavis.edu.s3.amazonaws.com/mRNAseq-semi-2015-03-04/SRR534006_2.fastq.gz
curl -O -L http://dib-training.ucdavis.edu.s3.amazonaws.com/mRNAseq-semi-2015-03-04/SRR536786_1.fastq.gz
curl -O -L http://dib-training.ucdavis.edu.s3.amazonaws.com/mRNAseq-semi-2015-03-04/SRR536786_2.fastq.gz
curl -O -L http://dib-training.ucdavis.edu.s3.amazonaws.com/mRNAseq-semi-2015-03-04/SRR536787_1.fastq.gz
curl -O -L http://dib-training.ucdavis.edu.s3.amazonaws.com/mRNAseq-semi-2015-03-04/SRR536787_2.fastq.gz
chmod u-w *.gz
Go back to the work directory, and copy them in:
cd /mnt/work
ln -fs /mnt/data/SRR534006_1.fastq.gz female_repl2_R1.fq.gz
ln -fs /mnt/data/SRR534006_2.fastq.gz female_repl2_R2.fq.gz
ln -fs /mnt/data/SRR536786_1.fastq.gz male_repl1_R1.fq.gz
ln -fs /mnt/data/SRR536786_2.fastq.gz male_repl1_R2.fq.gz
ln -fs /mnt/data/SRR536787_1.fastq.gz male_repl2_R1.fq.gz
ln -fs /mnt/data/SRR536787_2.fastq.gz male_repl2_R2.fq.gz
TrimmomaticPE female_repl2_R1.fq.gz female_repl2_R2.fq.gz\
female_repl2_R1.qc.fq.gz s1_se female_repl2_R2.qc.fq.gz s2_se \
ILLUMINACLIP:TruSeq2-PE.fa:2:40:15 \
LEADING:2 TRAILING:2 \
SLIDINGWINDOW:4:2 \
MINLEN:25
TrimmomaticPE male_repl1_R1.fq.gz male_repl1_R2.fq.gz\
male_repl1_R1.qc.fq.gz s1_se male_repl1_R2.qc.fq.gz s2_se \
ILLUMINACLIP:TruSeq2-PE.fa:2:40:15 \
LEADING:2 TRAILING:2 \
SLIDINGWINDOW:4:2 \
MINLEN:25
TrimmomaticPE male_repl2_R1.fq.gz male_repl2_R2.fq.gz\
male_repl2_R1.qc.fq.gz s1_se male_repl2_R2.qc.fq.gz s2_se \
ILLUMINACLIP:TruSeq2-PE.fa:2:40:15 \
LEADING:2 TRAILING:2 \
SLIDINGWINDOW:4:2 \
MINLEN:25