You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I’m currently using it to generate mutant bacterial genomes with both SNPs and insertions, as I’m developing an SNP calling tool and need to analyze the genomic coordinates between the reference and mutant genomes. I noticed there’s a file named "*.refseq2simseq.map.txt", but it only provides the coordinates of SNPs and indels, rather than the global coordinates.
Would it be possible for the tool to output all genome coordinates between the input and simulated genomes? For example, something like:
This feature would be incredibly helpful about having the ground truth for SNP calling, especially when simulating bacterial genomes from a phylogenetic tree (e.g., A is the input of SimuG for B, B for C, but I want to align C back to A).
Thank you very much for considering this request!
The text was updated successfully, but these errors were encountered:
I've put some thoughts on your suggestions. I felt it will be tricky to keep tracking the global coordinates during the variant introduction phase as variants were more or less randomly generated and put into the reference genome rather than placed in order from leftmost (5') to the rightmost (3'). That's why in the "*.refseq2simseq.map.txt" file, the variant_id in the variant_id column seems random since the id actually reflect the generation order of these variants during the simulation phase.
It seems to me that the best way to accomplish what you want would be to write a standalone script to re-generate the global coordinates by reconstructing the ref-to-sim sequence alignment based on the "*.refseq2simseq.map.txt" file. But even with this extra script, what you will get is still a pairwise alignment, which will need to be adjusted again say you want to further align it with another simulated genome (e.g., genome C for the example that you mentioned). This will be particularly tricky when dealing with insertions.
Therefore, it seems to me that the best solution will be just using existing whole-genome alignment tools (e.g. Cactus or fsa) to generated the multi-genome alignment directly from all your simulated genomes. The noise introduced by the sequence alignment process (e.g., equivalent gap insertion choices) should not affect the result too much. Also, variant normalization ([https://genome.sph.umich.edu/wiki/Variant_Normalization]) will always be needed to account for equivalent ways of variant representations by different variant calling tools before comparing the simulated and called results.
Hi Jia-Xing,
Thank you for developing this amazing tool!
I’m currently using it to generate mutant bacterial genomes with both SNPs and insertions, as I’m developing an SNP calling tool and need to analyze the genomic coordinates between the reference and mutant genomes. I noticed there’s a file named "*.refseq2simseq.map.txt", but it only provides the coordinates of SNPs and indels, rather than the global coordinates.
Would it be possible for the tool to output all genome coordinates between the input and simulated genomes? For example, something like:
ref_chr ref_pos sim_chr sim_pos
Ref 1 Sim 1
Ref 2 Sim -
Ref 3 Sim 2
Ref - Sim 3
...
("-" here refers to the insertion)
This feature would be incredibly helpful about having the ground truth for SNP calling, especially when simulating bacterial genomes from a phylogenetic tree (e.g., A is the input of SimuG for B, B for C, but I want to align C back to A).
Thank you very much for considering this request!
The text was updated successfully, but these errors were encountered: