Skip to content

Conversation

mike-w-wilson
Copy link
Contributor

No description provided.

- Create annotations directory structure for v5
- Add main frequency calculation script with differential analysis approach
- Support for both gnomAD and All of Us datasets
- Framework for identifying samples to remove due to relatedness/ancestry changes
- Age histogram calculation functionality
- Resource management and pipeline structure
…ry change detection

- Add coverage data integration using AN from coverage computation
- Implement ancestry change detection between v4 and v5
- Add comprehensive documentation in README.md
- Enhance sample identification logic for both gnomAD and All of Us datasets
- Add error handling for missing resources
- Support for ancestry change frequency calculations
- Improve resource management and pipeline structure
…ling and AoU processing

- Refactor process_gnomad_dataset to handle consent sample withdrawals by subtracting frequencies from v4 freq table
- Implement efficient AoU processing using variant_data + all-sites AN approach
- Add comprehensive utility functions and checkpoints for performance
- Integrate FAF, grpmax, and age histogram calculations
- Add group membership resource integration for both gnomAD and AoU datasets
- Include robust error handling and logging throughout pipeline
Copy link
Contributor

@ch-kr ch-kr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only really read through mt_hists_fields, _prepare_consent_vds, and _calculate_consent_frequencies, but happy to review more if helpful. the adjustment order of adj -> sex ploidy adjustment -> homalt hotfix looks like the same as v3's, so that LGTM.

one thing I didn't realize until reviewing this PR is that we didn't adjust the quality histograms between v3 and v4 for the genomes, which makes me think we shouldn't adjust these for v5 either. maybe we should discuss this at a meeting?

)


def mt_hists_fields(mt: hl.MatrixTable) -> hl.StructExpression:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we move this function (minus the high ab het) into the coverage/AN PR? the qual hists needs to be calculated on the dense MT

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually also, it doesn't look like we adjusted the qual hists for v4 from v3:

def get_histograms(ht: hl.Table, v3_sites_ht: hl.Table) -> hl.Table:
. maybe we should leave the quality hists as they are since they already do not reflect the 76,215 genomes in v4 vs the 76,156 genomes in v3?

"""
logger.info("Loading and preparing VDS for consent withdrawal samples...")

vds = get_gnomad_v4_genomes_vds(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe I should move the code for get_gnomad_v5_genomes_vds out of the coverage/AN PR so you can also use it here?

Comment on lines +146 to +149
consent_samples_list = consent_samples_ht.s.collect()

logger.info("Filtering VDS to consent withdrawal samples...")
vds = hl.vds.filter_samples(vds, consent_samples_list, keep=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
consent_samples_list = consent_samples_ht.s.collect()
logger.info("Filtering VDS to consent withdrawal samples...")
vds = hl.vds.filter_samples(vds, consent_samples_list, keep=True)
logger.info("Filtering VDS to consent withdrawal samples...")
vds = hl.vds.filter_samples(vds, consent_samples_ht, keep=True)

filter_samples also accepts a Table here which would avoid the collect

)
# For genomes, fixed_homalt_model is always False since we apply v3-style correction to all samples
# (following v3 and v4 genomes approach - no GATK version-based differentiation)
vmt = vmt.annotate_cols(fixed_homalt_model=hl.bool(False))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since this script is only going to change the gnomad genomes (and this will always be False), I vote we remove this field and update high_ab_het to no longer expect it

)

vds = hl.vds.VariantDataset(vds.reference_data, vmt)
vds = vds.checkpoint(new_temp_file("consent_samples_vds", "vds"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you add this checkpoint because of the sample filtering above?

vmt = vds.variant_data
vmt = vmt.annotate_rows(v4_af=v4_freq_ht[vmt.row_key].freq[0].AF)

# This follows the v3/v4 genomes workflow for adj and sex adjusted genotypes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this comment also mention that the homalt hotfix is applied after this also for consistency with v3, even though that should actually happen first?

the correct order is homalt hot fix -> adjust sex ploidy -> annotate adj. if I read this right, it looks like we might have done adjust sex ploidy -> annotate adj -> homalt hot fix for v4?

Comment on lines +190 to +201
ab_cutoff = 0.9
ab_expr = vmt.AD[1] / vmt.DP
vmt = vmt.select_entries(
"AD",
"DP",
"GQ",
"_het_non_ref",
"adj",
GT=adjusted_sex_ploidy_expr(vmt.locus, vmt.GT, vmt.sex_karyotype),
_het_ab=ab_expr,
_high_ab_het_ref=(ab_expr > ab_cutoff) & ~vmt._het_non_ref,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can remove these (het_ab, high_ab_het_ref) if we move the hist code into the coverage/AN PR since hom_alt_depletion_fix will recalculate them

    vmt = vmt.select_entries(
        "AD",
        "DP",
        "GQ",
        "_het_non_ref",
        "adj",
        GT=adjusted_sex_ploidy_expr(vmt.locus, vmt.GT, vmt.sex_karyotype),
    )

consent_freq_ht.AC[i] > 0,
consent_freq_ht.AC[i]
/ hl.float32(866 * 2), # consent_ans_ht[consent_freq_ht.key].AN[i],
0.0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you need to explicitly set AF to 0.0 with this if_else? could you do something like hl.float64(consent_freq_ht.AC[i] / consent_freq_ht.AN[i])?

return consent_freq_ht.checkpoint(new_temp_file("consent_freq", "ht"))


def _subtract_consent_frequencies_and_histograms(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should the histogram subtraction be moved to the coverage/AN code? should we even adjust the qual hists (https://github.com/broadinstitute/gnomad_qc/blob/main/gnomad_qc/v4/annotations/generate_freq_genomes.py#L1135)?

4. Calculating frequencies and age histograms for consent withdrawal samples
5. Subtracting both frequencies and age histograms from v4 frequency HT
6. Only overwriting fields that were actually updated in the final output
7. Computing FAF, grpmax, gen_anc_faf_max, and inbreeding coefficient
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we also move inbreeding into the coverage/AN code?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants