Variation in CpG Site Counts Across Samples Using aligned_bam_to_cpg_scores #75

syeoeonn · 2025-01-08T17:43:32Z

Hello,

I am currently running aligned_bam_to_cpg_scores (2.3.2) on HiFi long-read data from multiple individuals.
For each sample, I have created a combined.bed file, and both --pileup-mode and --modsites-mode are set to their default values (model, denovo).

Issue Encountered:
As a result of the analysis,
the number of CpG sites varies for each sample (the number of rows in each sample's combined.bed file).

On average, each person has about 28.4 million CpG sites,
However, a specific sample has about 27.2 million CpG sites, showing a difference of over one million.

The reasons I have considered for this are as follows:

Variant differences
1-1. Variants present in each sample may cause CpG sites to disappear or appear.
=> However, considering that information about CpG sites is outputted in the case of heterozygous SNPs (where one allele has a CpG and the other does not), I believe that the number of CpG sites per sample is not significantly affected by SNPs.
1-2. The limitation that CpG sites cannot be outputted for insertion variants. (Is CpG status predicted at insertion variants? #40 (comment))
Issue with read data amount
The sample with fewer identified CpG sites had a lower total read data compared to other samples (10×).
=> Is there a minimum depth required to reliably obtain CpG site information?

Or could there be other reasons for the differences in CpG site counts?

Additionally, I have a question regarding the modification score column in the combined.bed file.
Is the modification score meant to represent methylation probability?
I would like to know if the column is named "modification" because it might include information on modifications other than methylation.

Best regards,
Seoyeon Kim

holtjma · 2025-01-08T18:30:45Z

Hi Seoyeon,

(1-2) For your reduced CpG count, it is most likely caused by coverage. The model mode requires 4x coverage or it will not report the CpG at all. You could verify this by plotting the coverage of the missing CpGs relative to your other samples, which I suspect will have a peak <= 4x.

(3) Yes, it represents the probability that a read is methylated at that site. For details on each mode's calculation, you can find more information here: https://github.com/PacificBiosciences/pb-CpG-tools?tab=readme-ov-file#output-modes-and-option-details. To my knowledge, the choice of the word "modification" was just a generic term and is not meant to indicate anything beyond CpG methylation.

Matt

syeoeonn · 2025-01-15T08:53:01Z

Hi,

I investigated the coverage of the missing CpG sites and found that 74% of them have a coverage of less than 4x.

Thank you for your help!

Seoyeon Kim

holtjma · 2025-01-15T14:18:36Z

Going to close this for now, but feel free to re-open if you have any follow-up questions!

holtjma closed this as completed Jan 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Variation in CpG Site Counts Across Samples Using aligned_bam_to_cpg_scores #75

Variation in CpG Site Counts Across Samples Using aligned_bam_to_cpg_scores #75

syeoeonn commented Jan 8, 2025

holtjma commented Jan 8, 2025

syeoeonn commented Jan 15, 2025

holtjma commented Jan 15, 2025

Variation in CpG Site Counts Across Samples Using aligned_bam_to_cpg_scores #75

Variation in CpG Site Counts Across Samples Using aligned_bam_to_cpg_scores #75

Comments

syeoeonn commented Jan 8, 2025

holtjma commented Jan 8, 2025

syeoeonn commented Jan 15, 2025

holtjma commented Jan 15, 2025