-
Notifications
You must be signed in to change notification settings - Fork 28
v5 freq ht generation #720
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
- Create annotations directory structure for v5 - Add main frequency calculation script with differential analysis approach - Support for both gnomAD and All of Us datasets - Framework for identifying samples to remove due to relatedness/ancestry changes - Age histogram calculation functionality - Resource management and pipeline structure
…ry change detection - Add coverage data integration using AN from coverage computation - Implement ancestry change detection between v4 and v5 - Add comprehensive documentation in README.md - Enhance sample identification logic for both gnomAD and All of Us datasets - Add error handling for missing resources - Support for ancestry change frequency calculations - Improve resource management and pipeline structure
…ling and AoU processing - Refactor process_gnomad_dataset to handle consent sample withdrawals by subtracting frequencies from v4 freq table - Implement efficient AoU processing using variant_data + all-sites AN approach - Add comprehensive utility functions and checkpoints for performance - Integrate FAF, grpmax, and age histogram calculations - Add group membership resource integration for both gnomAD and AoU datasets - Include robust error handling and logging throughout pipeline
…ataset is being used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only really read through mt_hists_fields
, _prepare_consent_vds
, and _calculate_consent_frequencies
, but happy to review more if helpful. the adjustment order of adj -> sex ploidy adjustment -> homalt hotfix looks like the same as v3's, so that LGTM.
one thing I didn't realize until reviewing this PR is that we didn't adjust the quality histograms between v3 and v4 for the genomes, which makes me think we shouldn't adjust these for v5 either. maybe we should discuss this at a meeting?
) | ||
|
||
|
||
def mt_hists_fields(mt: hl.MatrixTable) -> hl.StructExpression: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we move this function (minus the high ab het) into the coverage/AN PR? the qual hists needs to be calculated on the dense MT
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually also, it doesn't look like we adjusted the qual hists for v4 from v3:
def get_histograms(ht: hl.Table, v3_sites_ht: hl.Table) -> hl.Table: |
""" | ||
logger.info("Loading and preparing VDS for consent withdrawal samples...") | ||
|
||
vds = get_gnomad_v4_genomes_vds( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe I should move the code for get_gnomad_v5_genomes_vds
out of the coverage/AN PR so you can also use it here?
consent_samples_list = consent_samples_ht.s.collect() | ||
|
||
logger.info("Filtering VDS to consent withdrawal samples...") | ||
vds = hl.vds.filter_samples(vds, consent_samples_list, keep=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consent_samples_list = consent_samples_ht.s.collect() | |
logger.info("Filtering VDS to consent withdrawal samples...") | |
vds = hl.vds.filter_samples(vds, consent_samples_list, keep=True) | |
logger.info("Filtering VDS to consent withdrawal samples...") | |
vds = hl.vds.filter_samples(vds, consent_samples_ht, keep=True) |
filter_samples also accepts a Table here which would avoid the collect
) | ||
# For genomes, fixed_homalt_model is always False since we apply v3-style correction to all samples | ||
# (following v3 and v4 genomes approach - no GATK version-based differentiation) | ||
vmt = vmt.annotate_cols(fixed_homalt_model=hl.bool(False)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since this script is only going to change the gnomad genomes (and this will always be False), I vote we remove this field and update high_ab_het
to no longer expect it
) | ||
|
||
vds = hl.vds.VariantDataset(vds.reference_data, vmt) | ||
vds = vds.checkpoint(new_temp_file("consent_samples_vds", "vds")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did you add this checkpoint because of the sample filtering above?
vmt = vds.variant_data | ||
vmt = vmt.annotate_rows(v4_af=v4_freq_ht[vmt.row_key].freq[0].AF) | ||
|
||
# This follows the v3/v4 genomes workflow for adj and sex adjusted genotypes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this comment also mention that the homalt hotfix is applied after this also for consistency with v3, even though that should actually happen first?
the correct order is homalt hot fix -> adjust sex ploidy -> annotate adj. if I read this right, it looks like we might have done adjust sex ploidy -> annotate adj -> homalt hot fix for v4?
ab_cutoff = 0.9 | ||
ab_expr = vmt.AD[1] / vmt.DP | ||
vmt = vmt.select_entries( | ||
"AD", | ||
"DP", | ||
"GQ", | ||
"_het_non_ref", | ||
"adj", | ||
GT=adjusted_sex_ploidy_expr(vmt.locus, vmt.GT, vmt.sex_karyotype), | ||
_het_ab=ab_expr, | ||
_high_ab_het_ref=(ab_expr > ab_cutoff) & ~vmt._het_non_ref, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can remove these (het_ab, high_ab_het_ref) if we move the hist code into the coverage/AN PR since hom_alt_depletion_fix
will recalculate them
vmt = vmt.select_entries(
"AD",
"DP",
"GQ",
"_het_non_ref",
"adj",
GT=adjusted_sex_ploidy_expr(vmt.locus, vmt.GT, vmt.sex_karyotype),
)
consent_freq_ht.AC[i] > 0, | ||
consent_freq_ht.AC[i] | ||
/ hl.float32(866 * 2), # consent_ans_ht[consent_freq_ht.key].AN[i], | ||
0.0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you need to explicitly set AF to 0.0 with this if_else? could you do something like hl.float64(consent_freq_ht.AC[i] / consent_freq_ht.AN[i])
?
return consent_freq_ht.checkpoint(new_temp_file("consent_freq", "ht")) | ||
|
||
|
||
def _subtract_consent_frequencies_and_histograms( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should the histogram subtraction be moved to the coverage/AN code? should we even adjust the qual hists (https://github.com/broadinstitute/gnomad_qc/blob/main/gnomad_qc/v4/annotations/generate_freq_genomes.py#L1135)?
4. Calculating frequencies and age histograms for consent withdrawal samples | ||
5. Subtracting both frequencies and age histograms from v4 frequency HT | ||
6. Only overwriting fields that were actually updated in the final output | ||
7. Computing FAF, grpmax, gen_anc_faf_max, and inbreeding coefficient |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we also move inbreeding into the coverage/AN code?
No description provided.