Building a new reference transcriptome¶

The distinguishing feature of (what I call) semi-model organisms is that while they may have a decent genome reference, their transcriptome annotation is poor. There can be several reasons for this, but generally it boils down to lack of resources and/or attention – it takes a lot of effort to build a high quality transcriptome!

For this purpose, we’ve already installed the chicken reference genome set on the HPC (as part of the data you loaded at the beginning). In this case we’ve loaded in the Illumina iGenomes project into the RNAseq-semimodel location.

See the TopHat and Cufflinks paper:

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3334321/

Install TopHat and Cufflinks¶

Download and install the TopHat and Cufflinks software:

cd ~/
curl -O http://ccb.jhu.edu/software/tophat/downloads/tophat-2.0.13.Linux_x86_64.tar.gz
tar xzf tophat-2.0.13.Linux_x86_64.tar.gz

curl -O http://cole-trapnell-lab.github.io/cufflinks/assets/downloads/cufflinks-2.2.1.Linux_x86_64.tar.gz
tar xzf cufflinks-2.2.1.Linux_x86_64.tar.gz

echo export PATH=$PATH:$HOME/tophat-2.0.13.Linux_x86_64:$HOME/cufflinks-2.2.1.Linux_x86_64 >> ~/.bashrc
export PATH=$PATH:$HOME/tophat-2.0.13.Linux_x86_64:$HOME/cufflinks-2.2.1.Linux_x86_64

Grab the genome¶

We will need the chicken genome! We’ll grab the UCSC galGal3 genome from the Illumina iGenomes project:

mkdir /mnt/genome
cd /mnt/genome
curl -O -L http://dib-training.ucdavis.edu.s3.amazonaws.com/mRNAseq-semi-2015-03-04/Gallus_gallus_UCSC_galGal3.tar.gz

tar xzvf Gallus_gallus_UCSC_galGal3.tar.gz

Map all the reads to the genome with TopHat¶

Do:

cd /mnt/work
tophat -p 4 \
    -o tophat_all \
    /mnt/genome/Gallus_gallus/UCSC/galGal3/Sequence/Bowtie2Index/genome \
 female_repl1_R1.qc.fq.gz,male_repl1_R1.qc.fq.gz,female_repl2_R1.qc.fq.gz,male_repl2_R1.qc.fq.gz \
 female_repl1_R2.qc.fq.gz,male_repl1_R2.qc.fq.gz,female_repl2_R2.qc.fq.gz,male_repl2_R2.qc.fq.gz

Questions:

What are all these parameters?!
How do we pick the transcriptome/genome?
Why is it so slow?
What is different about mapping RNAseq reads vs mapping genomic reads?

Links:

Evaluating the mapping¶

Check out the details:

less tophat_all/align_summary.txt

Build a new transcriptome (“ab initio”) from the combined reads using Cufflinks¶

Now that we’ve mapped the reads, let’s put them all together into exons and gene models:

cufflinks -o cuff_all tophat_all/accepted_hits.bam

Questions:

What exactly is Cufflinks doing?

Merge the new transcriptome with the existing reference transcriptome¶

We already have some decent gene models; let’s merge our new and the old ones:

ls -1 cuff_all/transcripts.gtf > cuff_list.txt

cuffmerge -g /mnt/genome/Gallus_gallus/UCSC/galGal3/Annotation/Archives/archive-2014-05-23-16-03-55/Genes/genes.gtf \
    -o cuffmerge_all \
    -s /mnt/genome/Gallus_gallus/UCSC/galGal3/Sequence/WholeGenomeFasta/genome.fa \
    cuff_list.txt

Do some cleanup:

curl -O http://2015-mar-semimodel.readthedocs.org/en/latest/_static/remove-nostrand.py
python remove-nostrand.py cuffmerge_all/merged.gtf > cuffmerge_all/nostrand.gtf

Questions:

why do you want to merge?
why would you have a list of more than one thing in list.txt?
come to think of it, why aren’t you (re)mapping all your reads every time?
what’s with the ‘remove nostrand’ script?

Extracting your new transcriptome sequences¶

To get a look at the actual DNA sequences, do:

gffread -w cuffmerge_all.fa \
        -g /mnt/genome/Gallus_gallus/UCSC/galGal3/Sequence/WholeGenomeFasta/genome.fa \
        cuffmerge_all/nostrand.gtf

Questions:

What’s the difference between a GTF file and the FA file?

Checking out your new transcriptome¶

Take a look at the top of your FASTA file:

head -30 cuffmerge_all.fa

Head on over to the chicken genome browser and try BLATing the sequence!

You can also get statistics for all of the different gene list files (.gtf) by doing:

cuffcompare cuffmerge_all/nostrand.gtf \
      /mnt/genome/Gallus_gallus/UCSC/galGal3/Annotation/Archives/archive-2014-05-23-16-03-55/Genes/genes.gtf \
      cuffmerge_all/merged.gtf -o compare

and then looking at compare.stats:

less compare.stats

Next: Mapping reads to the transcriptome with TopHat

LICENSE: This documentation and all textual/graphic site content is licensed under the Creative Commons - 0 License (CC0) -- fork @ github. Presentations (PPT/PDF) and PDFs are the property of their respective owners and are under the terms indicated within the presentation.

Navigation

Building a new reference transcriptome¶

Install TopHat and Cufflinks¶

Grab the genome¶

Map all the reads to the genome with TopHat¶

Evaluating the mapping¶

Build a new transcriptome (“ab initio”) from the combined reads using Cufflinks¶

Merge the new transcriptome with the existing reference transcriptome¶

Extracting your new transcriptome sequences¶

Checking out your new transcriptome¶

Navigation