Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Global coordinates between input genomes and simulated genomes #19

Open
liaoherui opened this issue Dec 12, 2024 · 2 comments
Open

Global coordinates between input genomes and simulated genomes #19

liaoherui opened this issue Dec 12, 2024 · 2 comments

Comments

@liaoherui
Copy link

liaoherui commented Dec 12, 2024

Hi Jia-Xing,

Thank you for developing this amazing tool!

I’m currently using it to generate mutant bacterial genomes with both SNPs and insertions, as I’m developing an SNP calling tool and need to analyze the genomic coordinates between the reference and mutant genomes. I noticed there’s a file named "*.refseq2simseq.map.txt", but it only provides the coordinates of SNPs and indels, rather than the global coordinates.

Would it be possible for the tool to output all genome coordinates between the input and simulated genomes? For example, something like:

ref_chr ref_pos sim_chr sim_pos
Ref 1 Sim 1
Ref 2 Sim -
Ref 3 Sim 2
Ref - Sim 3
...

("-" here refers to the insertion)

This feature would be incredibly helpful about having the ground truth for SNP calling, especially when simulating bacterial genomes from a phylogenetic tree (e.g., A is the input of SimuG for B, B for C, but I want to align C back to A).

Thank you very much for considering this request!

@yjx1217
Copy link
Owner

yjx1217 commented Dec 13, 2024

Hi @liaoherui

Thanks for trying out simuG for your project!

I've put some thoughts on your suggestions. I felt it will be tricky to keep tracking the global coordinates during the variant introduction phase as variants were more or less randomly generated and put into the reference genome rather than placed in order from leftmost (5') to the rightmost (3'). That's why in the "*.refseq2simseq.map.txt" file, the variant_id in the variant_id column seems random since the id actually reflect the generation order of these variants during the simulation phase.

It seems to me that the best way to accomplish what you want would be to write a standalone script to re-generate the global coordinates by reconstructing the ref-to-sim sequence alignment based on the "*.refseq2simseq.map.txt" file. But even with this extra script, what you will get is still a pairwise alignment, which will need to be adjusted again say you want to further align it with another simulated genome (e.g., genome C for the example that you mentioned). This will be particularly tricky when dealing with insertions.

Therefore, it seems to me that the best solution will be just using existing whole-genome alignment tools (e.g. Cactus or fsa) to generated the multi-genome alignment directly from all your simulated genomes. The noise introduced by the sequence alignment process (e.g., equivalent gap insertion choices) should not affect the result too much. Also, variant normalization ([https://genome.sph.umich.edu/wiki/Variant_Normalization]) will always be needed to account for equivalent ways of variant representations by different variant calling tools before comparing the simulated and called results.

Best,
Jia-Xing

@liaoherui
Copy link
Author

Dear Jia-Xing,

Thanks a lot for the prompt reply!!! I will try the solutions you suggested. Again, really appreciate your precious time and valueable suggestions!

Best,
Herui

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants