Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make bed file for reads processed by primrose #11

Open
oushujun opened this issue May 20, 2022 · 8 comments
Open

Make bed file for reads processed by primrose #11

oushujun opened this issue May 20, 2022 · 8 comments
Labels
enhancement New feature or request

Comments

@oushujun
Copy link

Hello,

I am trying to find out the methylation status of haplotype-specific variants. These variants may not be assembled in the haploid genome (no mapping) and thus I need to go back to the reads to measure relative richness. Can you help to develop some codes to convert the CpG status of each read into a bed file, so that calling methylation is entirely based on --min-passes and independent of genome assembly (no mapping needed)?

Making a bed file for all reads could be slow and big, alternatively, can it take a subset of reads (i.e., unmapped or unreliability mapped) and generate bed-formatted output?

The bed file could be like:

#read_id start end status pass
m54333U_210725_040653/7/ccs 189 190 methylated 4
m54333U_210725_040899/7/ccs 198 199 unmethylated 7
...

Thank you!
Shujun

@ctsa ctsa added the enhancement New feature or request label May 27, 2022
@ctsa
Copy link
Member

ctsa commented May 27, 2022

Thanks for describing your use case. Would you already have a BAM file with MM/ML tags for your unmapped reads? If so, it sounds like you're looking for more convenient access to the methylation data than the MM/ML tags themselves -- is this accurate?

@oushujun
Copy link
Author

Yes this is accurate. Is that an easy way to translate MM/ML tags to the bed format?

@ctsa
Copy link
Member

ctsa commented May 27, 2022

The information you want is embedded in the MM/ML tags. It's not trivial to convert it into the bed format you describe, but this would give you the most control over probability thresholds for the meth/unmath states you'd like to include. I'm not sure if or when we would be able to help out with such a converter but its helpful to capture the use case here as a first step.

@dportik
Copy link
Member

dportik commented Jun 3, 2022

Hi @oushujun,
Can you describe in more detail what this information will be used for?

It is possible to write code to dump the values per read into bed format, but the size of the output file will balloon very quickly because we are looking at sites per read rather than sites across the reference.

The output you suggested contains a binary methylated or unmethylated state, but this would be a simplification of a continuous probability score. We perform a correction in the model-based pileup to improve on the single molecule accuracy. Taking values from individual reads is less desirable. If you could describe how you intend to use these individual values, that might help us think of a better way to implement a feature like this.

@oushujun
Copy link
Author

oushujun commented Jun 4, 2022

Hello @dportik,

Thinking about uncertainties in a genome assembly, including assembly errors, heterozygosity, unassembled highly repetitive regions, etc, going back to the read level will be helpful to detect methylation status of these regions.

@oushujun oushujun closed this as completed Jun 4, 2022
@oushujun oushujun reopened this Jun 4, 2022
@oushujun
Copy link
Author

oushujun commented Jun 4, 2022

Sorry, I accidentally closed the issue.

Using binary states for methylations is a simplification. If possible you may also use a continuous probability score. We may not able to apply the model-based pileup to improve the call accuracy on read level, but it's still way "better than nothing."

The file could be very large for all reads, that's why if the script can take in a list of reads, it will alleviate this issue and make the analysis more specific.

@mrvollger
Copy link

mrvollger commented May 13, 2023

@oushujun my tool can do this (and it is pretty fast) if you are okay working outside of the pacbio toolset. See ft extract at https://github.com/fiberseq/fibertools-rs

You can work on subsets of reads by filtering the input bam file if it has an index. e.g.:

samtools view -@ 8 -u input.bam chr20:1-10000000 | ft extract -r --cpg - | bgzip -@ 8 > chr20.cpg.bed12.gz

@oushujun
Copy link
Author

@mrvollger Thank you! It looks very powerful!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants