Building a new reference transcriptome

The distinguishing feature of (what I call) semi-model organisms is that while they may have a decent genome reference, their transcriptome annotation is poor. There can be several reasons for this, but generally it boils down to lack of resources and/or attention – it takes a lot of effort to build a high quality transcriptome!

For this purpose, we’ve already installed the chicken reference genome set on the HPC (as part of the data you loaded at the beginning). In this case we’ve loaded in the Illumina iGenomes project into the RNAseq-semimodel location.

See the TopHat and Cufflinks paper:

Grab the genome

We will need the chicken genome! We’ll grab the UCSC galGal3 genome from the Illumina iGenomes project:

mkdir /mnt/genome
cd /mnt/genome
curl -O -L

tar xzvf Gallus_gallus_UCSC_galGal3.tar.gz

Map all the reads to the genome with TopHat


cd /mnt/work
tophat -p 4 \
    -o tophat_all \
    /mnt/genome/Gallus_gallus/UCSC/galGal3/Sequence/Bowtie2Index/genome \
 female_repl1_R1.qc.fq.gz,male_repl1_R1.qc.fq.gz,female_repl2_R1.qc.fq.gz,male_repl2_R1.qc.fq.gz \


  • What are all these parameters?!
  • How do we pick the transcriptome/genome?
  • Why is it so slow?
  • What is different about mapping RNAseq reads vs mapping genomic reads?


Evaluating the mapping

Check out the details:

less tophat_all/align_summary.txt

Merge the new transcriptome with the existing reference transcriptome

We already have some decent gene models; let’s merge our new and the old ones:

ls -1 cuff_all/transcripts.gtf > cuff_list.txt

cuffmerge -g /mnt/genome/Gallus_gallus/UCSC/galGal3/Annotation/Archives/archive-2014-05-23-16-03-55/Genes/genes.gtf \
    -o cuffmerge_all \
    -s /mnt/genome/Gallus_gallus/UCSC/galGal3/Sequence/WholeGenomeFasta/genome.fa \

Do some cleanup:

curl -O
python cuffmerge_all/merged.gtf > cuffmerge_all/nostrand.gtf


  • why do you want to merge?
  • why would you have a list of more than one thing in list.txt?
  • come to think of it, why aren’t you (re)mapping all your reads every time?
  • what’s with the ‘remove nostrand’ script?

Extracting your new transcriptome sequences

To get a look at the actual DNA sequences, do:

gffread -w cuffmerge_all.fa \
        -g /mnt/genome/Gallus_gallus/UCSC/galGal3/Sequence/WholeGenomeFasta/genome.fa \


  • What’s the difference between a GTF file and the FA file?

Checking out your new transcriptome

Take a look at the top of your FASTA file:

head -30 cuffmerge_all.fa

Head on over to the chicken genome browser and try BLATing the sequence!

You can also get statistics for all of the different gene list files (.gtf) by doing:

cuffcompare cuffmerge_all/nostrand.gtf \
      /mnt/genome/Gallus_gallus/UCSC/galGal3/Annotation/Archives/archive-2014-05-23-16-03-55/Genes/genes.gtf \
      cuffmerge_all/merged.gtf -o compare

and then looking at compare.stats:

less compare.stats

Next: Mapping reads to the transcriptome with TopHat

LICENSE: This documentation and all textual/graphic site content is licensed under the Creative Commons - 0 License (CC0) -- fork @ github. Presentations (PPT/PDF) and PDFs are the property of their respective owners and are under the terms indicated within the presentation.