Skip to content

Commit

Permalink
Merge pull request #43 from kids-first/feature/refactor-annot
Browse files Browse the repository at this point in the history
💫 Refactor Annotation
  • Loading branch information
migbro authored Feb 15, 2024
2 parents b6b4f31 + 0afe96d commit e1f0ae6
Show file tree
Hide file tree
Showing 21 changed files with 389 additions and 1,556 deletions.
3 changes: 3 additions & 0 deletions .gitmodules
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
[submodule "kf-annotation-tools"]
path = kf-annotation-tools
url = https://github.com/kids-first/kf-annotation-tools
3 changes: 2 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,8 @@ Complete documentation can be found for the main workflow an its subworkflows he
- [Germline Variant Workflow](./docs/GERMLINE_VARIANT_README.md)
- [CNV Variant Workflow](./docs/GERMLINE_CNV_README.md)
- [SNV Variant Workflow](./docs/GERMLINE_SNV_README.md)
- [SNV Annotation Workflow](./docs/GERMLINE_SNV_ANNOT_README.md)
- Leverages the git submodule kf-annotation-tools for variant annotation with VEP 105 and gnomAD
- Details linked in the readme.
- [SV Variant Workflow](./docs/GERMLINE_SV_README.md)

### Other Workflows
Expand Down
9 changes: 5 additions & 4 deletions docs/GATK_GERMLINE_README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# Kids First DRC Single Sample Genotyping Workflow
Kids First Data Resource Center Single Sample Genotyping Workflow. This workflow closely mirrors the [Kids First DRC Joint Genotyping Workflow](https://github.com/kids-first/kf-jointgenotyping-workflow/blob/master/workflow/kfdrc_jointgenotyping_refinement_workflow.cwl).
Kids First Data Resource Center Single Sample Genotyping Workflow. This workflow closely mirrors the [Kids First DRC Joint Genotyping Workflow](https://github.com/kids-first/kf-jointgenotyping-workflow/blob/master/workflow/kfdrc-jointgenotyping-refinement-workflow.cwl).
While the Joint Genotyping Workflow is meant to be used with trios, this workflow is meant for processing single samples.
The key difference in this pipeline is a change in filtering between when the final VCF is gathered by GATK GatherVcfCloud and when it is annotated by VEP bcftools (see [Kids First DRC Germline SNV Annotation Workflow docs](https://github.com/kids-first/kf-germline-workflow/blob/master/docs/GERMLINE_SNV_ANNOT_README.md) ).
The key difference in this pipeline is a change in filtering between when the final VCF is gathered by GATK GatherVcfCloud and when it is annotated by VEP bcftools (see [Kids First DRC Germline SNV Annotation Workflow docs](https://github.com/kids-first/kf-annotation-tools/blob/v1.1.0/docs/GERMLINE_SNV_ANNOT_README.md) ).
Unlike the Joint Genotyping Workflow, a germline-oriented [GATK hard filtering process](https://gatk.broadinstitute.org/hc/en-us/articles/360035890471-Hard-filtering-germline-short-variants) is performed and CalculateGenotypePosteriors has been removed.
While somatic samples can be run through this workflow, be wary that the filtering process is specifically tuned for germline data.

Expand All @@ -19,7 +19,7 @@ Single 6 GB gVCF on spot instances: 420 minutes & $4.00
1. Here we recommend to use GRCh38 as reference genome to do the analysis, positions in gVCF should be GRCh38 too.
1. Reference locations:
- https://console.cloud.google.com/storage/browser/genomics-public-data/resources/broad/hg38/v0/
- kfdrc bucket: s3://kids-first-seq-data/broad-references/
- kfdrc bucket: s3://kids-first-seq-data/broad-references/, s3://kids-first-seq-data/pipeline-references/
- cavatica: https://cavatica.sbgenomics.com/u/kfdrc-harmonization/kf-references/
1. Suggested inputs:
- Axiom_Exome_Plus.genotypes.all_populations.poly.hg38.vcf.gz
Expand All @@ -36,7 +36,8 @@ Single 6 GB gVCF on spot instances: 420 minutes & $4.00
- wgs_evaluation_regions.hg38.interval_list
- homo_sapiens_merged_vep_105_indexed_GRCh38.tar.gz, from ftp://ftp.ensembl.org/pub/release-105/variation/indexed_vep_cache/, then indexed using `convert_cache.pl`
See germline annotation docs linked above.
- gnomad_3.1.1.vwb_subset.vcf.gz
- gnomad_3.1.1.custom.echtvar.zip
1. Optional inputs:
- clinvar_20220507_chr.vcf.gz
- dbNSFP4.3a_grch38.gz
- CADDv1.6-38-gnomad.genomes.r3.0.indel.tsv.gz
Expand Down
2 changes: 1 addition & 1 deletion docs/GERMLINE_CNV_README.md
Original file line number Diff line number Diff line change
Expand Up @@ -74,6 +74,6 @@ https://github.com/broadinstitute/gatk/blob/master/docs/CNV/germline-cnv-caller-
- [Common Workflow Language reference implementation (cwltool)](https://github.com/common-workflow-language/cwltool/)

## References
- KFDRC AWS s3 bucket: s3://kids-first-seq-data/broad-references/
- KFDRC AWS s3 bucket: s3://kids-first-seq-data/broad-references/, s3://kids-first-seq-data/pipeline-references/
- Cavatica: https://cavatica.sbgenomics.com/u/kfdrc-harmonization/kf-references/
- Broad Institute Goolge Cloud: https://console.cloud.google.com/storage/browser/genomics-public-data/resources/broad/hg38/v0/
202 changes: 0 additions & 202 deletions docs/GERMLINE_SNV_ANNOT_README.md

This file was deleted.

16 changes: 10 additions & 6 deletions docs/GERMLINE_SNV_README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ The Kids First Data Resource Center (KFDRC) Single Nucleotide Variant (SNV)
Workflow is a common workflow language (CWL) implmentation to generate
SNV calls from an aligned reads BAM or CRAM file. The workflow makes use of
GATK, Freebayes, and Strelka2 callers then performs annotation using VEP,
gnomAD, and ClinVar.
gnomAD.

## Relevant Softwares and Versions

Expand Down Expand Up @@ -44,7 +44,7 @@ variation across the samples under analysis.
### GATK Single Sample Germline Variant Discovery

For GATK we use our [Kids First DRC Single Sample Genotyping
Workflow](./docs/GATK_GERMLINE_README.md). This workflow calls variants using a
Workflow](./GATK_GERMLINE_README.md). This workflow calls variants using a
gVCF that is made unless the user provides one themselves.

### Strelka2
Expand All @@ -70,8 +70,8 @@ documentation](https://github.com/Illumina/strelka/blob/v2.9.x/docs/userGuide/RE
### Annotation

Variants from all three callers are annotated using the [Kids First DRC
Germline SNV Annotation Workflow](./docs/GERMLINE_SNV_ANNOT_README.md).
Generally, this workflow annotates the workflow using VEP, gnomAD, and ClinVar.
Germline SNV Annotation Workflow](../kf-annotation-tools/docs/GERMLINE_SNV_ANNOT_README.md).
Generally, this workflow annotates the workflow using VEP, gnomAD.
For more information on the specific annotations, please see the documentation.

## Input Files
Expand All @@ -97,9 +97,13 @@ For more information on the specific annotations, please see the documentation.
NA128 NA12878 0 0 2 2
```
- Annotation

Recommended:
- `gnomad_annotation_vcf`: gnomAD VCF used for annotation
- `clinvar_annotation_vcf`: ClinVar VCF used for annotation
- `vep_cache`: TAR.GZ cache from ensembl/local converted cache

Optional:
- `clinvar_annotation_vcf`: ClinVar VCF used for annotation
- `dbnsfp`: VEP-formatted plugin file, index, and readme file containing dbNSFP annotations
- `cadd_indels`: VEP-formatted plugin file and index containing CADD indel annotations
- `cadd_snvs`: VEP-formatted plugin file and index containing CADD SNV annotations
Expand Down Expand Up @@ -132,6 +136,6 @@ For more information on the specific annotations, please see the documentation.
- [Common Workflow Language reference implementation (cwltool)](https://github.com/common-workflow-language/cwltool/)

## References
- KFDRC AWS s3 bucket: s3://kids-first-seq-data/broad-references/
- KFDRC AWS s3 bucket: s3://kids-first-seq-data/broad-references/, s3://kids-first-seq-data/pipeline-references/
- Cavatica: https://cavatica.sbgenomics.com/u/kfdrc-harmonization/kf-references/
- Broad Institute Goolge Cloud: https://console.cloud.google.com/storage/browser/genomics-public-data/resources/broad/hg38/v0/
Loading

0 comments on commit e1f0ae6

Please sign in to comment.