Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Where is transcriptome_hg38_gencodev23.fasta from? #170

Closed
wangshun1121 opened this issue Dec 13, 2018 · 3 comments
Closed

Where is transcriptome_hg38_gencodev23.fasta from? #170

wangshun1121 opened this issue Dec 13, 2018 · 3 comments

Comments

@wangshun1121
Copy link

wangshun1121 commented Dec 13, 2018

Dear Dr. Vivian:

In your Method introduction, you mentioned Kallisto index was made using following command:

sudo docker run -v $(pwd):/data quay.io/ucsc_cgl/kallisto index -i hg38.gencodeV23.transcripts.idx transcriptome_hg38_gencodev23.fasta

Now, I am trying to make similar index using salmon. But I wonder where istranscriptome_hg38_gencodev23.fasta from? Is it gencode.v23.transcripts.fa.gz , or gencode.v23.pc_transcripts.fa.gz , or otherwise you made by yourself with other tools such as gffread?

This issue is essential, since most RNA seq data sets on TCGA are from ployA libraries. The reference file should include sequences with polyA such as coding genes, immunoglobulins, and even some pseudogenes, while sequences without polyA such as lncRNAs should be removed. gencode.v23.pc_transcripts.fa, in my opinion, is suitable for reference, rather than full transcript sets.

There is no transcriptome_hg38_gencodev23.fasta on GENCODE's ftp, then I wonder where is this file from?

@wangshun1121
Copy link
Author

gencode.v23.transcripts.fa.gz: ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_23/gencode.v23.transcripts.fa.gz

gencode.v23.pc_transcripts.fa.gz: ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_23/gencode.v23.pc_transcripts.fa.gz

@wangshun1121
Copy link
Author

You mentioned that:

Gencode v23 annotations were downloaded from Gencode. Comprehensive gene annotation (Regions=CHR) GTF was used to generate all additional reference input data.

However, using your Kallisto index, I tried to quantile transcript expression data. Here I'd like to share some transcripts are in Kallisto index, but not included in CHR comprehensive gene annotation:

ENST00000383212.4
ENST00000211372.9
ENST00000383108.6
and so on...

These transcripts can be seen in gencode.v23.chr_patch_hapl_scaff.annotation.gtf.gz

There are totally 15208 transcripts included in Kallisto index, but not included in gene annotation file.

@jvivian
Copy link
Collaborator

jvivian commented Dec 13, 2018

Dear @wangshun1121 ,

Thank you for your inquiry. I apologize this was not properly explained in the paper — I'll add a note in the documentation which reflects the following reply I made in an email a few months ago:

But there are also 15,208 extra transcripts in Xena set of identifiers while not in gencode v23 which I don’t understand how it is possible.

"When I was preparing for the large-scale compute I had a request to use the (comprehensive) superset of Gencode v23 which contains additional transcripts (second row) — in hindsight, I likely wouldn't have done this because it means the transcript results between RSEM and Kallisto are not 1:1 and it's confusing in general. To get the transcriptome to work with Kallisto I also had to remove the overlapping genes from the PAR locus (chrY:10,000-2,781,479 and chrY:56,887,902-57,217,415). The reference genome I used also had the alt alleles removed for mapping reasons.

If you check the transcripts that don't match the Gencode v23 CHR GTF file, they should almost all be alt alleles. "

There is no transcriptome_hg38_gencodev23.fasta on GENCODE's ftp, then I wonder where is this file from?

I made it myself with a BedToFasta tool that I've now replaced with TopHat2's "GTFtoFasta" tool, which you can see in the workflow's input generation here

I hope this answers your question and I apolgize for the confusion caused. Let me know if you have any other questions I can help with.

@jvivian jvivian closed this as completed Dec 13, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants