v5 freq ht generation #720

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Draft

mike-w-wilson wants to merge 66 commits into main from mw/v5_freq_calc

+1,403 −0

Contributor

mike-w-wilson commented Sep 30, 2025

No description provided.

mike-w-wilson added 30 commits

July 24, 2025 13:42


          Add v5 frequency calculation script framework

8c13f9a

- Create annotations directory structure for v5
- Add main frequency calculation script with differential analysis approach
- Support for both gnomAD and All of Us datasets
- Framework for identifying samples to remove due to relatedness/ancestry changes
- Age histogram calculation functionality
- Resource management and pipeline structure


          Enhance v5 frequency calculation with coverage integration and ancest…

aec9318

…ry change detection

- Add coverage data integration using AN from coverage computation
- Implement ancestry change detection between v4 and v5
- Add comprehensive documentation in README.md
- Enhance sample identification logic for both gnomAD and All of Us datasets
- Add error handling for missing resources
- Support for ancestry change frequency calculations
- Improve resource management and pipeline structure


          Add utility functions for coverage integration

705c452


          Rewrite v5 frequency calculation to use v4 frequency HT as base

6f30ff2


          Update documentation to reflect correct v5 frequency calculation appr…

491143d

…oach


          Update v5 frequency calculation to use VDS approach

40d2560


          Fix TODOs and uncomment function calls in v5 frequency script

bbabdb3


          Update frequency script to our style to increase readability

d53cf8e


          Merge branch 'main' into mw/v5_freq_calc

370693b


          Implement v5 frequency generation with gnomAD consent withdrawal hand…

e98ad90

…ling and AoU processing

- Refactor process_gnomad_dataset to handle consent sample withdrawals by subtracting frequencies from v4 freq table
- Implement efficient AoU processing using variant_data + all-sites AN approach
- Add comprehensive utility functions and checkpoints for performance
- Integrate FAF, grpmax, and age histogram calculations
- Add group membership resource integration for both gnomAD and AoU datasets
- Include robust error handling and logging throughout pipeline


          Remove redundant functions, add proper AN calc

305b9a0


          Move to sparse aggs for consent ACs,homalt

f71f81e


          Add resources and remove unused imports and functions, clarify what d…

57d1910

…ataset is being used


          Only do a single pass of the data for freq field aggs


          Drop v5 downsamplings from constants

b2711d6


          Update annotation resources

9652a2e


          Update annotation constants

028f62a


          Correct VDS imported

125da25


          Add remove hard filtered samples false to get around unfound file

63f0549


          Update test args for data test or runtime test

9c4796f


          Update process dataset functions to use utility functions, increasing…

a626890

… readability


          Merge remote-tracking branch 'origin/main' into mw/v5_freq_calc

f7e09c6


          Correct group membership ht call

18aab39


          Correct group membership ht calls

709c462


          Fix filter_partitions call, int to a list

ba767e9


          Properly handle GATK versions in hom alt depletion correction for gno…

23653a1

…mad vs aou


          Mock AN for testing

ba4ba85


          Apply v3 fix to freq hom alt depletion

b31c912


          Update freq calc

c605c0e


          Correct fold in ac hom alt calc

a6b1b91

mike-w-wilson added 29 commits

October 1, 2025 15:56


          int64 -> int32 in freq struct

8a29a1c


          int64 -> int32 for hom alt in freq struct

624393f


          reorder freq struct for join

2f2d4e0


          Spread out global indexing to avoid chain error

f3d4edb


          More global rearrangement for merging

69f6a73


          More global rearrangement for merging

c68450a


          Typo in global declaration

245c324


          Another global attempt

efe795a


          Another global attempt

d830564


          Do not create intermediate HT with new expressions

e979fd1


          Change to select to get around mismatch

04848fb


          Use hl.literal since index_globals appears to not work

43c9dad


          Add show for merge testing

ba3562a


          Set negatives to 0 to investigate negatives

ff0661f


          Drop meta print

0cecd0a


          Filter to consent variants

8fe908a


          Drop second meta print

780da86


          Copy v3/v4 genomes sex ploidy, adj order

7ef8502


          Add notes for consent hom alt fix approach

3f3c91d


          Reference same vmt for adj, hom alt pass

020b16c


          Fix adj annotation

bc0f4cd


          Filter to common sites before doing any work in freq or hists

bcacd84


          Process hists and freqs together

2fc7b24


          Remove unused functions for AoU freq

c89185b


          Fix genotype call in age hists

11e31d1


          Fix gt call in age hists, second attempt

009e209


          Age histogram does not expect an integer...

7c8cf39


          Remove unused overwrite

267888b


          Updated README to new workflow

321f345

ch-kr reviewed

View reviewed changes

Contributor

ch-kr left a comment

I only really read through mt_hists_fields, _prepare_consent_vds, and _calculate_consent_frequencies, but happy to review more if helpful. the adjustment order of adj -> sex ploidy adjustment -> homalt hotfix looks like the same as v3's, so that LGTM.

one thing I didn't realize until reviewing this PR is that we didn't adjust the quality histograms between v3 and v4 for the genomes, which makes me think we shouldn't adjust these for v5 either. maybe we should discuss this at a meeting?

gnomad_qc/v5/annotations/generate_frequency.py

		)


		def mt_hists_fields(mt: hl.MatrixTable) -> hl.StructExpression:

Contributor

ch-kr Oct 3, 2025

should we move this function (minus the high ab het) into the coverage/AN PR? the qual hists needs to be calculated on the dense MT

Contributor

ch-kr Oct 3, 2025

actually also, it doesn't look like we adjusted the qual hists for v4 from v3:

gnomad_qc/gnomad_qc/v4/annotations/generate_freq_genomes.py

Line 1135 in a6e0e7f

def get_histograms(ht: hl.Table, v3_sites_ht: hl.Table) -> hl.Table:

. maybe we should leave the quality hists as they are since they already do not reflect the 76,215 genomes in v4 vs the 76,156 genomes in v3?

gnomad_qc/v5/annotations/generate_frequency.py

+                  """
+                  logger.info("Loading and preparing VDS for consent withdrawal samples...")
+                  vds = get_gnomad_v4_genomes_vds(

Contributor

ch-kr Oct 3, 2025

maybe I should move the code for get_gnomad_v5_genomes_vds out of the coverage/AN PR so you can also use it here?

gnomad_qc/v5/annotations/generate_frequency.py

Comment on lines +146 to +149

+                  consent_samples_list = consent_samples_ht.s.collect()
+                  logger.info("Filtering VDS to consent withdrawal samples...")
+                  vds = hl.vds.filter_samples(vds, consent_samples_list, keep=True)

Contributor

ch-kr Oct 3, 2025

Suggested change

      
                consent_samples_list = consent_samples_ht.s.collect()
          
                logger.info("Filtering VDS to consent withdrawal samples...")
          
                vds = hl.vds.filter_samples(vds, consent_samples_list, keep=True)
          
                logger.info("Filtering VDS to consent withdrawal samples...")
          
                vds = hl.vds.filter_samples(vds, consent_samples_ht, keep=True)

filter_samples also accepts a Table here which would avoid the collect

gnomad_qc/v5/annotations/generate_frequency.py

+                  )
+                  # For genomes, fixed_homalt_model is always False since we apply v3-style correction to all samples
+                  # (following v3 and v4 genomes approach - no GATK version-based differentiation)
+                  vmt = vmt.annotate_cols(fixed_homalt_model=hl.bool(False))

Contributor

ch-kr Oct 3, 2025

since this script is only going to change the gnomad genomes (and this will always be False), I vote we remove this field and update high_ab_het to no longer expect it

gnomad_qc/v5/annotations/generate_frequency.py

+                  )
+                  vds = hl.vds.VariantDataset(vds.reference_data, vmt)
+                  vds = vds.checkpoint(new_temp_file("consent_samples_vds", "vds"))

Contributor

ch-kr Oct 3, 2025

did you add this checkpoint because of the sample filtering above?

gnomad_qc/v5/annotations/generate_frequency.py

+                  vmt = vds.variant_data
+                  vmt = vmt.annotate_rows(v4_af=v4_freq_ht[vmt.row_key].freq[0].AF)
+                  # This follows the v3/v4 genomes workflow for adj and sex adjusted genotypes.

Contributor

ch-kr Oct 3, 2025

should this comment also mention that the homalt hotfix is applied after this also for consistency with v3, even though that should actually happen first?

the correct order is homalt hot fix -> adjust sex ploidy -> annotate adj. if I read this right, it looks like we might have done adjust sex ploidy -> annotate adj -> homalt hot fix for v4?

gnomad_qc/v5/annotations/generate_frequency.py

Comment on lines +190 to +201

+                  ab_cutoff = 0.9
+                  ab_expr = vmt.AD[1] / vmt.DP
+                  vmt = vmt.select_entries(
+                      "AD",
+                      "DP",
+                      "GQ",
+                      "_het_non_ref",
+                      "adj",
+                      GT=adjusted_sex_ploidy_expr(vmt.locus, vmt.GT, vmt.sex_karyotype),
+                      _het_ab=ab_expr,
+                      _high_ab_het_ref=(ab_expr > ab_cutoff) & ~vmt._het_non_ref,
+                  )

Contributor

ch-kr Oct 3, 2025

we can remove these (het_ab, high_ab_het_ref) if we move the hist code into the coverage/AN PR since hom_alt_depletion_fix will recalculate them

    vmt = vmt.select_entries(
        "AD",
        "DP",
        "GQ",
        "_het_non_ref",
        "adj",
        GT=adjusted_sex_ploidy_expr(vmt.locus, vmt.GT, vmt.sex_karyotype),
    )

gnomad_qc/v5/annotations/generate_frequency.py

+                                  consent_freq_ht.AC[i] > 0,
+                                  consent_freq_ht.AC[i]
+                                  / hl.float32(866 * 2),  # consent_ans_ht[consent_freq_ht.key].AN[i],
+.0,

Contributor

ch-kr Oct 3, 2025

do you need to explicitly set AF to 0.0 with this if_else? could you do something like hl.float64(consent_freq_ht.AC[i] / consent_freq_ht.AN[i])?

gnomad_qc/v5/annotations/generate_frequency.py

		return consent_freq_ht.checkpoint(new_temp_file("consent_freq", "ht"))


		def _subtract_consent_frequencies_and_histograms(

Contributor

ch-kr Oct 3, 2025

should the histogram subtraction be moved to the coverage/AN code? should we even adjust the qual hists (https://github.com/broadinstitute/gnomad_qc/blob/main/gnomad_qc/v4/annotations/generate_freq_genomes.py#L1135)?

gnomad_qc/v5/annotations/generate_frequency.py

+. Calculating frequencies and age histograms for consent withdrawal samples
+. Subtracting both frequencies and age histograms from v4 frequency HT
+. Only overwriting fields that were actually updated in the final output
+. Computing FAF, grpmax, gen_anc_faf_max, and inbreeding coefficient

Contributor

ch-kr Oct 3, 2025

should we also move inbreeding into the coverage/AN code?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet