-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Where is transcriptome_hg38_gencodev23.fasta from? #170
Comments
gencode.v23.transcripts.fa.gz: ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_23/gencode.v23.transcripts.fa.gz gencode.v23.pc_transcripts.fa.gz: ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_23/gencode.v23.pc_transcripts.fa.gz |
You mentioned that:
However, using your Kallisto index, I tried to quantile transcript expression data. Here I'd like to share some transcripts are in Kallisto index, but not included in CHR comprehensive gene annotation:
These transcripts can be seen in There are totally 15208 transcripts included in Kallisto index, but not included in gene annotation file. |
Dear @wangshun1121 , Thank you for your inquiry. I apologize this was not properly explained in the paper — I'll add a note in the documentation which reflects the following reply I made in an email a few months ago:
"When I was preparing for the large-scale compute I had a request to use the (comprehensive) superset of Gencode v23 which contains additional transcripts (second row) — in hindsight, I likely wouldn't have done this because it means the transcript results between RSEM and Kallisto are not 1:1 and it's confusing in general. To get the transcriptome to work with Kallisto I also had to remove the overlapping genes from the PAR locus (chrY:10,000-2,781,479 and chrY:56,887,902-57,217,415). The reference genome I used also had the alt alleles removed for mapping reasons. If you check the transcripts that don't match the Gencode v23 CHR GTF file, they should almost all be alt alleles. "
I made it myself with a BedToFasta tool that I've now replaced with TopHat2's "GTFtoFasta" tool, which you can see in the workflow's input generation here I hope this answers your question and I apolgize for the confusion caused. Let me know if you have any other questions I can help with. |
Dear Dr. Vivian:
In your Method introduction, you mentioned Kallisto index was made using following command:
Now, I am trying to make similar index using salmon. But I wonder where is
transcriptome_hg38_gencodev23.fasta
from? Is itgencode.v23.transcripts.fa.gz
, orgencode.v23.pc_transcripts.fa.gz
, or otherwise you made by yourself with other tools such as gffread?This issue is essential, since most RNA seq data sets on TCGA are from ployA libraries. The reference file should include sequences with polyA such as coding genes, immunoglobulins, and even some pseudogenes, while sequences without polyA such as lncRNAs should be removed.
gencode.v23.pc_transcripts.fa
, in my opinion, is suitable for reference, rather than full transcript sets.There is no
transcriptome_hg38_gencodev23.fasta
on GENCODE's ftp, then I wonder where is this file from?The text was updated successfully, but these errors were encountered: