Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can only run a few base positions on chromosome X #102

Open
milena-andreuzo opened this issue Jun 10, 2024 · 3 comments
Open

Can only run a few base positions on chromosome X #102

milena-andreuzo opened this issue Jun 10, 2024 · 3 comments

Comments

@milena-andreuzo
Copy link

Hello! I'm new to SHAPEIT and am trying to use it to phase my array data.

Initially I installed SHAPEIT5 from conda on a docker image and ran phase_common on a few chromosomes already (such as chr1, chr19, chr22) using AWS instances.
Documentation says that for running on chromosome X the only difference from the other chromosomes is the presence of a text file containing the haploid samples with --haploids argument. Despite that, I haven't been able to run phasing on chromosome X. Describing my first test:

  • Docker image with installation from bioconda;
  • Instance has 32 Gb memory and 4 vcpus;
  • I am using 1000G reference panel, pre-processed with bcftools:
    • Remove CNVs/SVs;
    • Keep only PASS variants;
    • Left-normalize and decompose multiallelic;
    • Calculate tags AC and AN.
  • Male samples from the reference panel were also added to the haploids.txt file.
  • Using the provided genetic map;
  • Input vcf is the Dragen Array vcf output files merged to get a multisample data from 24 samples and also processed for left-normalization, decompose and tags AC+AN with bcftools.

Resulting log:

[SHAPEIT5] phase_common (jointly phase multiple common markers)
  * Author        : Olivier DELANEAU, University of Lausanne
  * Contact       : [email protected]
  * Version       : 5.1.1 / commit = 5.1.1 / release = 2023-05-16
  * Run date      : 06/06/2024 - 15:12:30

Files:
  * Input         : [sample.vcf.gz]
  * Reference     : [reference_panels/GRCh37/chrX.vcf.gz]
  * Haploids      : [haploids.txt]
  * Genetic Map   : [gmaps/GRCh37/chrX.gmap.gz]
  * Output        : [sample.X.phased.bcf]
  * Output format : [bcf]

Parameters:
  * Seed    : 15052011
  * Threads : 4 threads
  * MCMC    : 15 iterations [5b + 1p + 1b + 1p + 1b + 1p + 5m]
  * PBWT    : [window = 4cM / depth = auto / modulo = auto / mac = 5 / missing = 0.1]
  * HMM     : [window = 4cM / Ne = 15000 / Recombination rates given by genetic map]

Reading genotype data:
305 Segmentation fault      (core dumped) 

I don't believe it is a memory issue, since chromosome 1 and the others were processed on the same conditions without any problems.

I tried many solutions already... some of them:

  • 128 Gb memory, 16 vcpus (because, well, what if).
  • haploids.txt containing only haploids for the main input file.
  • Removing PAR similar to what was done here, on both input file and reference panel.
  • Running without genetic map.
  • Made a new docker image, installing from the binaries available in the GitHub release.
  • Thought that maybe SHAPEIT can't deal with chromosome containing letters, so renamed chromosome on all files to '23'.
  • Tried the same chunks as here.

None of the above worked. I could however get a result when running on one or two positions... This worked:

SHAPEIT5_phase_common --input sample.vcf.gz --reference reference_panels/GRCh37/chrX.vcf.gz --region X:38009121,X:106244767 --map gmaps/GRCh37/chrX.gmap.gz --haploids haploids.txt --output sample.X.phased.bcf

[SHAPEIT5] phase_common (jointly phase multiple common markers)
  * Author        : Olivier DELANEAU, University of Lausanne
  * Contact       : [email protected]
  * Version       : 5.1.1 / commit = 5.1.1 / release = 2023-05-16
  * Run date      : 10/06/2024 - 19:40:39

Files:
  * Input         : [sample.vcf.gz]
  * Reference     : [reference_panels/GRCh37/chrX.vcf.gz]
  * Haploids      : [haploids.txt]
  * Genetic Map   : [gmaps/GRCh37/chrX.gmap.gz]
  * Output        : [sample.X.phased.bcf]
  * Output format : [bcf]

Parameters:
  * Seed    : 15052011
  * Threads : 1 threads
  * MCMC    : 15 iterations [5b + 1p + 1b + 1p + 1b + 1p + 5m]
  * PBWT    : [window = 4cM / depth = auto / modulo = auto / mac = 5 / missing = 0.1]
  * HMM     : [window = 4cM / Ne = 15000 / Recombination rates given by genetic map]

Reading genotype data:
  * VCF/BCF scanning done (1.83s)
      + Variants [#sites=1 / region=X:38009121,X:106244767]
         - 1 sites removed in main panel [not in reference panel]
      + Samples [#target=24 / #reference=2504]
  * VCF/BCF parsing done (1.40s)
      + Genotypes [0/0=91.667%, 0/1=8.333%, 1/1=0.000%, ./.=0.000%, 0|1=0.000%]
      + Reference haplotypes [0=98.383%, 1=1.617%]
  * Haploid parsing [n=1747] (0.00s)
  * Haploid sample mapping (0.00s)
      + #haploids = 7 / #diploids = 17
      + Hets set as missing: n=0 (0.000%)

Setting up genetic map:
  * GMAP parsing [n=89589] (0.57s)
  * cM interpolation [s=0 / i=1] (0.00s)
  * Region length [1 bp / 0.0 cM]
  * HMM parameters [Ne=15000 / Error=0.0001 / #rare=0]

Initializing data structures:
  * Impute monomorphic [n=0] (0.00s)
  * HAP update (0.00s)
  * H2V transpose (0.00s)
  * PBWT parameters auto setting : [modulo = 0.049 / depth = 6]
  * PBWT initialization [#eval=1 / #select=1 / #chunk=1] (0.00s)
  * PBWT phasing sweep (0.00s)
  * Build genotype graphs [seg=24] (0.00s)

Burn-in iteration [1/5]
  * PBWT selection (0.00s)
  * HMM computations [K=11.4+/-1.5 / W=0.00Mb / US=0 / UP=0] (0.00s)
  * IBD2 tracks [#inds=1 / #tracks=1 / #merged = 0]
  * HAP update (0.00s)
  * H2V transpose (0.00s)

Burn-in iteration [2/5]
  * PBWT selection (0.00s)
  * HMM computations [K=11.2+/-1.8 / W=0.00Mb / US=0 / UP=0] (0.00s)
  * IBD2 tracks [#inds=1 / #tracks=1 / #merged = 0]
  * HAP update (0.00s)
  * H2V transpose (0.00s)

Burn-in iteration [3/5]
  * PBWT selection (0.00s)
  * HMM computations [K=11.3+/-1.6 / W=0.00Mb / US=0 / UP=0] (0.00s)
  * IBD2 tracks [#inds=1 / #tracks=1 / #merged = 0]
  * HAP update (0.00s)
  * H2V transpose (0.00s)

Burn-in iteration [4/5]
  * PBWT selection (0.00s)
  * HMM computations [K=11.5+/-1.5 / W=0.00Mb / US=0 / UP=0] (0.00s)
  * IBD2 tracks [#inds=1 / #tracks=1 / #merged = 0]
  * HAP update (0.00s)
  * H2V transpose (0.00s)

Burn-in iteration [5/5]
  * PBWT selection (0.00s)
  * HMM computations [K=11.5+/-1.5 / W=0.00Mb / US=0 / UP=0] (0.00s)
  * IBD2 tracks [#inds=1 / #tracks=1 / #merged = 0]
  * HAP update (0.00s)
  * H2V transpose (0.00s)

Pruning iteration [1/1]
  * PBWT selection (0.00s)
  * HMM computations [K=11.2+/-1.8 / W=0.00Mb / US=0 / UP=0] (0.00s)
  * IBD2 tracks [#inds=1 / #tracks=1 / #merged = 0]
  * HAP update (0.00s)
  * H2V transpose (0.00s)
  * Trimming [pc=0.00%]

Burn-in iteration [1/1]
  * PBWT selection (0.00s)
  * HMM computations [K=11.5+/-1.5 / W=0.00Mb / US=0 / UP=0] (0.00s)
  * IBD2 tracks [#inds=1 / #tracks=1 / #merged = 0]
  * HAP update (0.00s)
  * H2V transpose (0.00s)

Pruning iteration [1/1]
  * PBWT selection (0.00s)
  * HMM computations [K=11.5+/-1.5 / W=0.00Mb / US=0 / UP=0] (0.00s)
  * IBD2 tracks [#inds=1 / #tracks=1 / #merged = 0]
  * HAP update (0.00s)
  * H2V transpose (0.00s)
  * Trimming [pc=0.00%]

Burn-in iteration [1/1]
  * PBWT selection (0.00s)
  * HMM computations [K=11.2+/-1.8 / W=0.00Mb / US=0 / UP=0] (0.00s)
  * IBD2 tracks [#inds=1 / #tracks=1 / #merged = 0]
  * HAP update (0.00s)
  * H2V transpose (0.00s)

Pruning iteration [1/1]
  * PBWT selection (0.00s)
  * HMM computations [K=11.5+/-1.5 / W=0.00Mb / US=0 / UP=0] (0.00s)
  * IBD2 tracks [#inds=1 / #tracks=1 / #merged = 0]
  * HAP update (0.00s)
  * H2V transpose (0.00s)
  * Trimming [pc=0.00%]

Main iteration [1/5]
  * PBWT selection (0.00s)
  * HMM computations [K=11.2+/-1.8 / W=0.00Mb / US=0 / UP=0] (0.00s)
  * IBD2 tracks [#inds=1 / #tracks=1 / #merged = 0]
  * HAP update (0.00s)
  * H2V transpose (0.00s)

Main iteration [2/5]
  * PBWT selection (0.00s)
  * HMM computations [K=11.2+/-1.8 / W=0.00Mb / US=0 / UP=0] (0.00s)
  * IBD2 tracks [#inds=1 / #tracks=1 / #merged = 0]
  * HAP update (0.00s)
  * H2V transpose (0.00s)

Main iteration [3/5]
  * PBWT selection (0.00s)
  * HMM computations [K=11.5+/-1.5 / W=0.00Mb / US=0 / UP=0] (0.00s)
  * IBD2 tracks [#inds=1 / #tracks=1 / #merged = 0]
  * HAP update (0.00s)
  * H2V transpose (0.00s)

Main iteration [4/5]
  * PBWT selection (0.00s)
  * HMM computations [K=11.2+/-1.8 / W=0.00Mb / US=0 / UP=0] (0.00s)
  * IBD2 tracks [#inds=1 / #tracks=1 / #merged = 0]
  * HAP update (0.00s)
  * H2V transpose (0.00s)

Main iteration [5/5]
  * PBWT selection (0.00s)
  * HMM computations [K=11.3+/-1.6 / W=0.00Mb / US=0 / UP=0] (0.00s)
  * IBD2 tracks [#inds=1 / #tracks=1 / #merged = 0]
  * HAP update (0.00s)
  * H2V transpose (0.00s)

Finalization:
  * HAP solving (0.00s)
  * HAP update (0.00s)
  * H2V transpose (0.00s)
  * VCF/BCF writing [N=24 / L=1] (0.01s)
  * Indexing files
  * Total running time = 3 seconds

I think I may be missing some silly detail... Ideally I would get the result for the full chromosome to get through with imputation... So I ran out of ideas. Can you help me? Thank you!

@dtaliun
Copy link

dtaliun commented Jun 11, 2024

Hi @milena-andreuzo, When you run chrX non-PAR in chunks, did all the chunks fail or only certain ones?

@milena-andreuzo
Copy link
Author

Hi @dtaliun! Tested the chunks here on chrX non-PAR (main and reference).

Regions X:1-27440787, X:24026048-50656185 and X:47743520-88764754 went straight to that same segmentation fault error.
The others, X:78359646-115258184, X:112836207-139441710 and X:137290831-1000000000 returned sites removed from the main panel because they are not on reference panel and so had no variants to be phased:

Reading genotype data:
  * VCF/BCF scanning done (0.06s)
      + Variants [#sites=0 / region=X:137290831-1000000000]
         - 4762 sites removed in main panel [not in reference panel]

ERROR: No variants to be **phased!**

All those chunks do have variants in the input file.

@milena-andreuzo
Copy link
Author

If I were to run chrX on shapeit5 v1.0.0/v5.0.0, what pre-processing steps would you suggest?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants