The distinguishing feature of (what I call) semi-model organisms is that while they may have a decent genome reference, their transcriptome annotation is poor. There can be several reasons for this, but generally it boils down to lack of resources and/or attention – it takes a lot of effort to build a high quality transcriptome!
For this purpose, we’ve already installed the chicken reference genome set on the HPC (as part of the data you loaded at the beginning). In this case we’ve loaded in the Illumina iGenomes project into the RNAseq-semimodel location.
See the TopHat and Cufflinks paper:
Download and install the TopHat and Cufflinks software:
cd ~/
curl -O http://ccb.jhu.edu/software/tophat/downloads/tophat-2.0.13.Linux_x86_64.tar.gz
tar xzf tophat-2.0.13.Linux_x86_64.tar.gz
curl -O http://cole-trapnell-lab.github.io/cufflinks/assets/downloads/cufflinks-2.2.1.Linux_x86_64.tar.gz
tar xzf cufflinks-2.2.1.Linux_x86_64.tar.gz
echo export PATH=$PATH:$HOME/tophat-2.0.13.Linux_x86_64:$HOME/cufflinks-2.2.1.Linux_x86_64 >> ~/.bashrc
export PATH=$PATH:$HOME/tophat-2.0.13.Linux_x86_64:$HOME/cufflinks-2.2.1.Linux_x86_64
We will need the chicken genome! We’ll grab the UCSC galGal3 genome from the Illumina iGenomes project:
mkdir /mnt/genome
cd /mnt/genome
curl -O -L http://dib-training.ucdavis.edu.s3.amazonaws.com/mRNAseq-semi-2015-03-04/Gallus_gallus_UCSC_galGal3.tar.gz
tar xzvf Gallus_gallus_UCSC_galGal3.tar.gz
Do:
cd /mnt/work
tophat -p 4 \
-o tophat_all \
/mnt/genome/Gallus_gallus/UCSC/galGal3/Sequence/Bowtie2Index/genome \
female_repl1_R1.qc.fq.gz,male_repl1_R1.qc.fq.gz,female_repl2_R1.qc.fq.gz,male_repl2_R1.qc.fq.gz \
female_repl1_R2.qc.fq.gz,male_repl1_R2.qc.fq.gz,female_repl2_R2.qc.fq.gz,male_repl2_R2.qc.fq.gz
Questions:
Links:
Now that we’ve mapped the reads, let’s put them all together into exons and gene models:
cufflinks -o cuff_all tophat_all/accepted_hits.bam
Questions:
We already have some decent gene models; let’s merge our new and the old ones:
ls -1 cuff_all/transcripts.gtf > cuff_list.txt
cuffmerge -g /mnt/genome/Gallus_gallus/UCSC/galGal3/Annotation/Archives/archive-2014-05-23-16-03-55/Genes/genes.gtf \
-o cuffmerge_all \
-s /mnt/genome/Gallus_gallus/UCSC/galGal3/Sequence/WholeGenomeFasta/genome.fa \
cuff_list.txt
Do some cleanup:
curl -O http://2015-mar-semimodel.readthedocs.org/en/latest/_static/remove-nostrand.py
python remove-nostrand.py cuffmerge_all/merged.gtf > cuffmerge_all/nostrand.gtf
Questions:
To get a look at the actual DNA sequences, do:
gffread -w cuffmerge_all.fa \
-g /mnt/genome/Gallus_gallus/UCSC/galGal3/Sequence/WholeGenomeFasta/genome.fa \
cuffmerge_all/nostrand.gtf
Questions:
Take a look at the top of your FASTA file:
head -30 cuffmerge_all.fa
Head on over to the chicken genome browser and try BLATing the sequence!
You can also get statistics for all of the different gene list files (.gtf) by doing:
cuffcompare cuffmerge_all/nostrand.gtf \
/mnt/genome/Gallus_gallus/UCSC/galGal3/Annotation/Archives/archive-2014-05-23-16-03-55/Genes/genes.gtf \
cuffmerge_all/merged.gtf -o compare
and then looking at compare.stats:
less compare.stats