You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am currently running aligned_bam_to_cpg_scores (2.3.2) on HiFi long-read data from multiple individuals.
For each sample, I have created a combined.bed file, and both --pileup-mode and --modsites-mode are set to their default values (model, denovo).
Issue Encountered:
As a result of the analysis,
the number of CpG sites varies for each sample (the number of rows in each sample's combined.bed file).
On average, each person has about 28.4 million CpG sites,
However, a specific sample has about 27.2 million CpG sites, showing a difference of over one million.
The reasons I have considered for this are as follows:
Variant differences
1-1. Variants present in each sample may cause CpG sites to disappear or appear.
=> However, considering that information about CpG sites is outputted in the case of heterozygous SNPs (where one allele has a CpG and the other does not), I believe that the number of CpG sites per sample is not significantly affected by SNPs.
1-2. The limitation that CpG sites cannot be outputted for insertion variants. (Is CpG status predicted at insertion variants? #40 (comment))
Issue with read data amount
The sample with fewer identified CpG sites had a lower total read data compared to other samples (10×).
=> Is there a minimum depth required to reliably obtain CpG site information?
Or could there be other reasons for the differences in CpG site counts?
Additionally, I have a question regarding the modification score column in the combined.bed file.
Is the modification score meant to represent methylation probability?
I would like to know if the column is named "modification" because it might include information on modifications other than methylation.
Best regards,
Seoyeon Kim
The text was updated successfully, but these errors were encountered:
(1-2) For your reduced CpG count, it is most likely caused by coverage. The model mode requires 4x coverage or it will not report the CpG at all. You could verify this by plotting the coverage of the missing CpGs relative to your other samples, which I suspect will have a peak <= 4x.
Hello,
I am currently running aligned_bam_to_cpg_scores (2.3.2) on HiFi long-read data from multiple individuals.
For each sample, I have created a combined.bed file, and both --pileup-mode and --modsites-mode are set to their default values (model, denovo).
Issue Encountered:
As a result of the analysis,
the number of CpG sites varies for each sample (the number of rows in each sample's combined.bed file).
The reasons I have considered for this are as follows:
Variant differences
1-1. Variants present in each sample may cause CpG sites to disappear or appear.
=> However, considering that information about CpG sites is outputted in the case of heterozygous SNPs (where one allele has a CpG and the other does not), I believe that the number of CpG sites per sample is not significantly affected by SNPs.
1-2. The limitation that CpG sites cannot be outputted for insertion variants. (Is CpG status predicted at insertion variants? #40 (comment))
Issue with read data amount
The sample with fewer identified CpG sites had a lower total read data compared to other samples (10×).
=> Is there a minimum depth required to reliably obtain CpG site information?
Or could there be other reasons for the differences in CpG site counts?
Is the modification score meant to represent methylation probability?
I would like to know if the column is named "modification" because it might include information on modifications other than methylation.
Best regards,
Seoyeon Kim
The text was updated successfully, but these errors were encountered: