Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Variation in CpG Site Counts Across Samples Using aligned_bam_to_cpg_scores #75

Closed
syeoeonn opened this issue Jan 8, 2025 · 3 comments

Comments

@syeoeonn
Copy link

syeoeonn commented Jan 8, 2025

Hello,

I am currently running aligned_bam_to_cpg_scores (2.3.2) on HiFi long-read data from multiple individuals.
For each sample, I have created a combined.bed file, and both --pileup-mode and --modsites-mode are set to their default values (model, denovo).

Issue Encountered:
As a result of the analysis,
the number of CpG sites varies for each sample (the number of rows in each sample's combined.bed file).

  • On average, each person has about 28.4 million CpG sites,
  • However, a specific sample has about 27.2 million CpG sites, showing a difference of over one million.

The reasons I have considered for this are as follows:

  1. Variant differences
    1-1. Variants present in each sample may cause CpG sites to disappear or appear.
    => However, considering that information about CpG sites is outputted in the case of heterozygous SNPs (where one allele has a CpG and the other does not), I believe that the number of CpG sites per sample is not significantly affected by SNPs.
    1-2. The limitation that CpG sites cannot be outputted for insertion variants. (Is CpG status predicted at insertion variants? #40 (comment))

  2. Issue with read data amount
    The sample with fewer identified CpG sites had a lower total read data compared to other samples (10×).
    => Is there a minimum depth required to reliably obtain CpG site information?

Or could there be other reasons for the differences in CpG site counts?

  1. Additionally, I have a question regarding the modification score column in the combined.bed file.
    Is the modification score meant to represent methylation probability?
    I would like to know if the column is named "modification" because it might include information on modifications other than methylation.

Best regards,
Seoyeon Kim

@holtjma
Copy link
Collaborator

holtjma commented Jan 8, 2025

Hi Seoyeon,

(1-2) For your reduced CpG count, it is most likely caused by coverage. The model mode requires 4x coverage or it will not report the CpG at all. You could verify this by plotting the coverage of the missing CpGs relative to your other samples, which I suspect will have a peak <= 4x.

(3) Yes, it represents the probability that a read is methylated at that site. For details on each mode's calculation, you can find more information here: https://github.com/PacificBiosciences/pb-CpG-tools?tab=readme-ov-file#output-modes-and-option-details. To my knowledge, the choice of the word "modification" was just a generic term and is not meant to indicate anything beyond CpG methylation.

Matt

@syeoeonn
Copy link
Author

Hi,

I investigated the coverage of the missing CpG sites and found that 74% of them have a coverage of less than 4x.

Thank you for your help!

Seoyeon Kim

@holtjma
Copy link
Collaborator

holtjma commented Jan 15, 2025

Going to close this for now, but feel free to re-open if you have any follow-up questions!

@holtjma holtjma closed this as completed Jan 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants