gwas-qc

A custom pipeline for harmonizing REGENIE/SAIGE summary stats from a variety of bio-banks.

How to use

Environment & Dependency Set Up

If the steps below do not work please contact [email protected] for assistance.

git clone https://github.com/pozdeyevlab/gwasqc.git
cd gwasqc
conda env create -f environment.yml
conda activate gwas_qc
poetry install
# To test correct install try
python modules/harmonize.py --help

Necessary Input files

meta analysis map file: See details below in the map file section for specific format of this file. Unfortunately, any user introduced errors are likely to orginate from improper formatting in this file.
gnomAD reference files: A directory of tab separated files named 'gnomad_ref_chr.tsv' which contain reference loci, reference and alternate alleles, and allele frequencies according to sex and ancestry. These files can be generates using the pipeline found here. here
gnomAD flagged variant files: A directory of tab separated files named 'flagged_variants_chr.tsv'. These files can be generated following the pipeline found here. here
config.yaml: See details below in config section.

After setting up everything above please test that the dag looks correct with --dry-run before proceeding to full run.

# adjust cores to your system or use 'all' to run in series use 1 core
# test
snakemake --cores 10 --configfile config.yaml --dry-run

# run
snakemake --cores 10 --configfile config.yaml

Input & Output

Input Files

Map file with the following column names (case sensitive):

Column Name	Description
BIOBANK	name of biobank (must be one word, we apologize for the onconvenience but 'Some_Bank' will yield buggy results instead please use 'SomeBank'!)
PHENOTYPE	disease phenotype
SEX	male, female or both - any other string will cause an error
ANCESTRY	EAS, EUR, HIS, AMR.. etc If the ancestry is unknown or the gwas is multi-ethnic leave the cell blank If the provided ancestry is not available in the reference data then mixed ancestry will be used by default
GWAS_SOFTWARE	saige or regenie If saige then summary stats are assumed to be tab separated If regenie then summary stats are assumed to be space separated
CHROM	column name in summary stat containing the chromosome
POS	column name in summary stat containing the position of the variant
Effect_Allele	column name in summary stat containing the effect allele
Non_Effect_Allele	column name in summary stat containing the non-effect allele
Effect_AF	column name in summary stat containing the effect allele frequency
BETA	column name in summary stat containing the beta value
SE	column name in summary stat containing the standard error
PVAL	column name in summary stat containing the p-value
ID	column name in summary stat containing the variant ID assumed to be in chr:pos:ref:alt format
IMPUTE	column name in summary stat containing the imputation values (Optional)
Total_N	column name in summary stat containing the total N
Case_N	column name in summary stat containing the number of case samples
Control_N	column name in summary stat containing the number of control samples
PATH	path to summary stat file
Case_Count	number of cases reported from biobank
Control_Count	number of controls reported from biobank

Config file with the following key-pair values:

Key	Value
map_file	"path/to/map/file”
gnomad_ref_dir	“path/to/gnomad/reference/file/directory” pipeline for generating reference files can be found here
gnomad_flag_dir	"path/to/gnomad/an_flagged/file/directory" some variants in gnomad have a quality warning when the alternate allele is found in less than half of all participants, these variants have been flagged for qc
output_path	“where/to/write/output/files/”

Output Files

Outputs:

Output Type	Path
Aligned variants	`<output_path>/<biobank>_<phenotype>_<sex>_<ancestry>_aligned_to_gnomad.tsv`
Un-Aligned variants	`<output_path>/<biobank>_<phenotype>_<sex>_<ancestry>_not_aligned_in_gnomad.tsv`
Variants without a matching reference position	`<output_path>/<biobank>_<phenotype>_<sex>_<ancestry>_no_position_in_gnomad.tsv`
Un-usable variants	`<output_path>/<biobank>_<phenotype>_<sex>_<ancestry>_invalid_variants.tsv` If alternate allele is invalid ie `CN0` then those variants are removed from analysis
Plots	`<output_path>/plots/<biobank>_<phenotype>_<sex>_<ancestry>.png` scatterplots, histograms, and bar charts to help decide qc cut-offs

Pipeline Overview

For each summary stat file per chromosome:
1. Read the chromosome specific gnomAD reference file into memory, scan only for positions that are present in the gwas summary file. Write variants without a matching position in gnomAD to a scratch file.
2. For summary stat variants with a matching position in gnomAD:
  1. Enforce variant id format ::<non_effect_allele>:<effect_allele>
    1. Variants with missing alleles or structural variants are thrown out.
  2. Create flags for the following:
    - p-value is 0
    - -1e6 < BETA < 1e6
    - -1e6 < SE < 1e6
    - If imputation data is available, imputation is less than imputation threshold (default: 0.3)
    - Variant alleles are palindromic
    - Variant alleles are palindromic and effect allele frequency is between 0.4 - 0.6
    - Effect and non effect alleles match the order found in the variant ID
3. Run modules/align_palindromic_variants.py on palindromic variants and modules/align_non_palindromic_variants.py on non palindromic variants. The reason for this distinction is that palindromic variants are not checked for transcribed matches, as that would be redundant.
4. For non palindromic variants
  1. Check for an exact match
    - Example: 1:100:A:G & 1:100:A:G
  2. Check for an inverse match
    - Example: 1:100:G:A & 1:100:A:G
    - IMPORTANT: The AF & Beta become 1 - AF & -1 * BETA
  3. For variants that align BOTH by exact and inverse match, find the absolute difference in allele frequency (AF) between the gnomAD reported AF and the gwas reported AF. The method that procuces the smallest difference in AF is saved as the true match. These counts are saved to the log file.
  4. If there is not an exact or inverse match then the varaint is tested for an exact transcribed match
    - Example: 1:100:T:C & 1:100:A:G
  5. If there is not an exact, inverse, or transcribed exact match then the variant is tested for an inverse transcribed match
    - Example: 1:100:C:T & 1:100:A:G
    - IMPORTANT:The AF & Beta become 1 - AF & -1 * BETA
5. For palindromic variants
  1. Check for an exact match
    - Example: 1:100:A:T & 1:100:A:T
  2. Check for an inverse match
    - Example: 1:100:T:A & 1:100:A:T
    - IMPORTANT: The AF & Beta become 1 - AF & -1 * BETA
  3. For variants that align BOTH by exact and inverse match, find the absolute difference in allele frequency (AF) between the gnomAD reported AF and the gwas reported AF. The method that procuces the smallest difference in AF is saved as the true match. These counts are saved to the log file.
6. Write the aligned, unaligned, and missing position variants to three respective output files

Per input gwas summary stat create the following plots:

* Scatterplot: colored by alignment method
* Scatterplot: colored by palinfromic or not palindromic
* Scatterplot: colored by chisq p-val > 0.05 or chisq p-val < 0.05
* Scatterplot: Plot of aligned variants with all possible filters applied
* Histogram: abs(Study_AF - Reference_AF)
* Bar plot: abs(Study_AF - Reference_AF) binned into intervals of 0.01 (0 - 1.0)

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
modules		modules
.gitignore		.gitignore
QC_Pipeline.jpg		QC_Pipeline.jpg
README.md		README.md
Snakefile		Snakefile
config.yaml		config.yaml
environment.yml		environment.yml
pyproject.toml		pyproject.toml
scratch.py		scratch.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

gwas-qc

How to use

Environment & Dependency Set Up

Necessary Input files

Input & Output

Input Files

Map file with the following column names (case sensitive):

Config file with the following key-pair values:

Output Files

Outputs:

Pipeline Overview

About

Releases

Packages

Languages

CCPM-IO/gwasqc

Folders and files

Latest commit

History

Repository files navigation

gwas-qc

How to use

Environment & Dependency Set Up

Necessary Input files

Input & Output

Input Files

Map file with the following column names (case sensitive):

Config file with the following key-pair values:

Output Files

Outputs:

Pipeline Overview

About

Resources

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages