Global coordinates between input genomes and simulated genomes #19

liaoherui · 2024-12-12T19:35:33Z

Hi Jia-Xing,

Thank you for developing this amazing tool!

I’m currently using it to generate mutant bacterial genomes with both SNPs and insertions, as I’m developing an SNP calling tool and need to analyze the genomic coordinates between the reference and mutant genomes. I noticed there’s a file named "*.refseq2simseq.map.txt", but it only provides the coordinates of SNPs and indels, rather than the global coordinates.

Would it be possible for the tool to output all genome coordinates between the input and simulated genomes? For example, something like:

ref_chr ref_pos sim_chr sim_pos
Ref 1 Sim 1
Ref 2 Sim -
Ref 3 Sim 2
Ref - Sim 3
...

("-" here refers to the insertion)

This feature would be incredibly helpful about having the ground truth for SNP calling, especially when simulating bacterial genomes from a phylogenetic tree (e.g., A is the input of SimuG for B, B for C, but I want to align C back to A).

Thank you very much for considering this request!

yjx1217 · 2024-12-13T12:34:47Z

Hi @liaoherui

Thanks for trying out simuG for your project!

I've put some thoughts on your suggestions. I felt it will be tricky to keep tracking the global coordinates during the variant introduction phase as variants were more or less randomly generated and put into the reference genome rather than placed in order from leftmost (5') to the rightmost (3'). That's why in the "*.refseq2simseq.map.txt" file, the variant_id in the variant_id column seems random since the id actually reflect the generation order of these variants during the simulation phase.

It seems to me that the best way to accomplish what you want would be to write a standalone script to re-generate the global coordinates by reconstructing the ref-to-sim sequence alignment based on the "*.refseq2simseq.map.txt" file. But even with this extra script, what you will get is still a pairwise alignment, which will need to be adjusted again say you want to further align it with another simulated genome (e.g., genome C for the example that you mentioned). This will be particularly tricky when dealing with insertions.

Therefore, it seems to me that the best solution will be just using existing whole-genome alignment tools (e.g. Cactus or fsa) to generated the multi-genome alignment directly from all your simulated genomes. The noise introduced by the sequence alignment process (e.g., equivalent gap insertion choices) should not affect the result too much. Also, variant normalization ([https://genome.sph.umich.edu/wiki/Variant_Normalization]) will always be needed to account for equivalent ways of variant representations by different variant calling tools before comparing the simulated and called results.

Best,
Jia-Xing

liaoherui · 2024-12-13T21:26:06Z

Dear Jia-Xing,

Thanks a lot for the prompt reply!!! I will try the solutions you suggested. Again, really appreciate your precious time and valueable suggestions!

Best,
Herui

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Global coordinates between input genomes and simulated genomes #19

Global coordinates between input genomes and simulated genomes #19

liaoherui commented Dec 12, 2024 •

edited

Loading

yjx1217 commented Dec 13, 2024

liaoherui commented Dec 13, 2024

Global coordinates between input genomes and simulated genomes #19

Global coordinates between input genomes and simulated genomes #19

Comments

liaoherui commented Dec 12, 2024 • edited Loading

yjx1217 commented Dec 13, 2024

liaoherui commented Dec 13, 2024

liaoherui commented Dec 12, 2024 •

edited

Loading