-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using latest 1k Genome dataset for building model #54
Comments
Hi Daniel,
Assuming 1 and 2 are possible, I see a path forward: Let me know what you think, I'm curious to hear your thoughts on how to solve number 1 above. Number 2 should be straightforward enough. Kevin |
Hi Kevin, |
Hi Kevin, |
The best file type would be a multi sample vcf. Each record in the vcf would have genotypes listed for each of the 3,202 genomes at that locus. Breaking these files out by chromosome would work well. Do you know if they will host the files? However, going back to this comment:
If I had the files above, before going through the effort of training a model on these reanalyzed genomes, I would compare the genotypes at the 128 AISNP locations between the original call set and the reanalyzed call set. If there is little variance between calls in the original and the calls in the reanalyzed calls at those AISNP loci, then I don't see any benefit to training a model on the new data. An interesting quasi-experiment would be to run the If the performance is good, then you might not need to retrain a model at all. |
Hi Kevin, |
I tried a couple of queries with what I thought would be the minimum fields needed for comparison. However, I had limited success. Here's what I did:
SELECT
*
FROM
var_nested
WHERE
(chrom, pos) IN (
('chr1', 101709563),
('chr1', 151122489),
('chr1', 159174683),
('chr2', 7968275),
('chr2', 17362568),
('chr2', 17901485),
('chr2', 109513601),
('chr2', 109579738),
('chr2', 136707982),
('chr2', 158667217),
('chr3', 121459589),
('chr4', 38815502),
('chr4', 100239319),
('chr4', 100244319),
('chr4', 105375423),
('chr5', 33951693),
('chr5', 170202984),
('chr5', 6845035),
('chr6', 136482727),
('chr6', 90518278),
('chr7', 28172586),
('chr8', 31896592),
('chr8', 110602317),
('chr8', 122124302),
('chr8', 145639681),
('chr9', 127267689),
('chr10', 94921065),
('chr11', 61597212),
('chr11', 113296286),
('chr12', 112211833),
('chr12', 112241766),
('chr13', 34847737),
('chr13', 41715282),
('chr13', 42579985),
('chr13', 49070512),
('chr13', 111827167),
('chr14', 99375321),
('chr15', 28197037),
('chr15', 28365618),
('chr15', 36220035),
('chr15', 45152371),
('chr15', 48426484),
('chr16', 89730827),
('chr17', 40658533),
('chr17', 41056245),
('chr17', 48726132),
('chr17', 53568884),
('chr17', 62987151),
('chr18', 35277622),
('chr18', 40488279),
('chr18', 67578931),
('chr18', 67867663),
('chr19', 4077096),
('chr20', 62159504),
('chr22', 41697338)
) This returned two records (note that I've truncated the gts field here):
Then I thought to try querying by rsid, which didn't need a liftover... SELECT
chrom,
pos,
"gt.alleles"
FROM
var_partby_samples
WHERE
rsid IN (
'rs3737576', 'rs7554936', 'rs2814778',
'rs798443', 'rs1876482', 'rs1834619',
'rs3827760', 'rs260690', 'rs6754311',
'rs10497191', 'rs12498138', 'rs4833103',
'rs1229984', 'rs3811801', 'rs7657799',
'rs16891982', 'rs7722456', 'rs870347',
'rs3823159', 'rs192655', 'rs917115',
'rs1462906', 'rs6990312', 'rs2196051',
'rs1871534', 'rs3814134', 'rs4918664',
'rs174570', 'rs1079597', 'rs2238151',
'rs671', 'rs7997709', 'rs1572018',
'rs2166624', 'rs7326934', 'rs9522149',
'rs200354', 'rs1800414', 'rs12913832',
'rs12439433', 'rs735480', 'rs1426654',
'rs459920', 'rs4411548', 'rs2593595',
'rs17642714', 'rs4471745', 'rs11652805',
'rs2042762', 'rs7226659', 'rs3916235',
'rs4891825', 'rs7251928', 'rs310644',
'rs2024566'
) But this query returned 0 records... Anything you can think of to modify these queries so they return more results? Depending on how you want to implement ezancestry, I still think you could probably use the pretrained models as-is. |
Hi Kevin, _"s3://1000genomes-dragen-3.7.6/data/cohorts/gvcf-genotyper-dragen-3.7.6/hg38/3202-samples-cohort/ The folder contains the results of gVCF Genotyping (joint calling) on the entire cohort of 3,202 samples in the NYGC 1KGP data set. The analysis was performed using DRAGEN v3.7.6 and our hg38-graph-based reference hash table. The results are broken up into per-chromosome multisample VCFs (e.g., 3202_samples_cohort_gg_chr1.vcf.gz)."_ Does this make accessing the needed data for training updated models easier? |
Hey Daniel, I was able to query Athena, will compare the genotypes this weekend! SELECT * FROM var_nested
WHERE (chrom, pos) IN
(('chr1', 101244007),
('chr1', 151150013),
('chr1', 159204893),
('chr2', 7828144),
('chr2', 17181301),
('chr2', 17720218),
('chr2', 108897145),
('chr2', 108963282),
('chr2', 135950412),
('chr2', 157810705),
('chr3', 121740742),
('chr4', 38813881),
('chr4', 99318162),
('chr4', 99323162),
('chr4', 104454266),
('chr5', 33951588),
('chr5', 170775980),
('chr5', 6844922),
('chr6', 136161589),
('chr6', 89808559),
('chr7', 28132967),
('chr8', 32039076),
('chr8', 109590088),
('chr8', 121112062),
('chr8', 144414297),
('chr9', 124505410),
('chr10', 93161308),
('chr11', 61829740),
('chr11', 113425564),
('chr12', 111774029),
('chr12', 111803962),
('chr13', 34273600),
('chr13', 41141146),
('chr13', 42005849),
('chr13', 48496376),
('chr13', 111174820),
('chr14', 98908984),
('chr15', 27951891),
('chr15', 28120472),
('chr15', 35927834),
('chr15', 44860173),
('chr15', 48134287),
('chr16', 89664419),
('chr17', 42506515),
('chr17', 42904228),
('chr17', 50648771),
('chr17', 55491523),
('chr17', 64991033),
('chr18', 37697659),
('chr18', 42908314),
('chr18', 69911695),
('chr18', 70200427),
('chr19', 4077098),
('chr20', 63528151),
('chr22', 41301334)) |
Thanks for the update! I hope this exploration opens up so exciting options for either this projects or others you are working :) |
Hi Kevin, |
Hi @dbrami -- sorry for the delayed response, I came back to look at this and am finally getting somewhere. |
Hi Kevin,
Let me know if there's a next step, ie updating the models is warranted at some point :) `(dnafinger) (base) ➜ VCFs head -n 2000 HG01082.hard-filtered.vcf | grep ':PS' | head chrM 146 . T C . PASS DP=9298;MQ=219.97;LOD=32665.90;FractionInformativeReads=0.991 GT:SQ:AD:AF:F1R2:F2R1:DP:SB:MB:PS 0|1:32665.90:0,9212:1.000:0,3975:0,5237:9212:0,0,3775,5437:0,0,4154,5058:146 |
Hi @dbrami I will assume that NaN's are 0|0's and train model on the Dragen data then compare the performance of the two models with 5-fold CV. I'll try to do that this weekend. |
Hey @dbrami I've updated the analysis here. The 5-fold cv performance (log loss) is better when using the 1kG data compared to Dragen. It looks like the Dragen snps had less variation at several sites. You should be able to run the notebook, I added the annotated Dragen AISNPs data file. I'll merge the analysis. |
Hi,
I'm exploring adding an ethnic background to sample QC pipeline I'm working on and this tool seems to check all the boxes
I was involved in making re-analysis of the latest addition of samples for the 1k Genome project available using Illumina DRAGEN available on AWS.
DRAGEN reanalysis of the 1000 Genomes Dataset now available on the Registry of Open Data
Although I can't find similar complete aggregate bed file as the ones pulled by your fetch script, do you think your pipeline can easily be modified to create a model using this updated data set?
Would love to hear your thoughts on cost/benefit of this approach.
Thanks,
Daniel Brami
The text was updated successfully, but these errors were encountered: