You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: browser/about/acofone/ac-of-one-part-one.md
+2-2Lines changed: 2 additions & 2 deletions
Original file line number
Diff line number
Diff line change
@@ -8,6 +8,6 @@ The vast majority of variants being discovered within large population datasets
8
8
9
9
Precision medicine research implicitly requires exploring individual genetic variants within the human genome. Because every individual’s genome harbors extensive unique variation, the vast majority of variants in any large scale study – and especially those novel variants – tend to be extremely rare. In fact, our expectation is that the majority of unique variants in any large genomic dataset will be present in at most 1 or 2 participants. It is therefore paramount that information about this critical class of variation is distributed to the research community.
10
10
11
-
This paper from the Exome Aggregation Consortium (ExAC) highlights the fact that most variants from a large diverse dataset will be rare1. Figure 1c, copied below, shows that more than 50% of the variants in the exome are singletons, present in only 1 individual. Similarly, unpublished data from the gnomAD version 4.1 exomes (restricting analysis to high quality variants in canonical transcripts) also shows that most variants discovered across over 730k individuals are rare.
11
+
[This paper](https://www.nature.com/articles/nature19057) from the Exome Aggregation Consortium (ExAC) highlights the fact that most variants from a large diverse dataset will be rare<sup>1</sup>. Figure 1c, copied below, shows that more than 50% of the variants in the exome are singletons, present in only 1 individual. Similarly, unpublished data from the gnomAD version 4.1 exomes (restricting analysis to high quality variants in canonical transcripts) also shows that most variants discovered across over 730k individuals are rare.
Copy file name to clipboardExpand all lines: browser/about/acofone/ac-of-one-part-three.md
+6-4Lines changed: 6 additions & 4 deletions
Original file line number
Diff line number
Diff line change
@@ -1,3 +1,5 @@
1
+
<br />
2
+
1
3
## Rare variants also tend to have the largest effect sizes
2
4
3
5
Biologically, the process of negative selection tends to decrease the frequency of functional damaging variants, which means that variants with the largest effect sizes are more likely to be rare. In other words, it is often the very rare variants that are of most scientific interest to researchers.
@@ -12,16 +14,16 @@ In theory, if one obfuscates or randomizes the counts/frequencies of variants, i
12
14
13
15
## Precedents through NIH and global initiatives
14
16
15
-
It is standard in the genomics field to display exact allele counts on public browsers as seen in gnomAD (gnomad.broadinstitute.org), All of Us (https://databrowser.researchallofus.org/snvsindels) and UK Biobank (genebass.org), as it
16
-
presents a very low risk to re-identification. Ultimately, the analysis and reporting of rare variants in research manuscripts presents minimal added risk once those variants and their frequencies are already present in a public browser (see below). And there is precedent for allowing such analysis in numerous scientific programs (many of which are NIH-funded). For example, the NHGRI-funded Clinical Genome Resource (ClinGen), working closely with policy leaders at NIH and the GA4GH, published guidance to laboratories stating that submission of classified variants, associated with phenotype, were allowable without consent and even if limited to a single observation given low risk to individuals and large benefit to science and medicine5 . This approach was endorsed by leaders in the UK in a publication6 documenting agreement with these principles and noting allowance under General Data Protection Regulation (GDPR). Furthermore, the ability to publish results of rare variant associations has also been adopted by the UK Biobank leading to widespread benefit to science and medicine without any demonstrated risk. Hundreds of studies of rare variant associations have been published in the past decade based on data released by the UK Biobank without barriers to analysis and publication.
17
+
It is standard in the genomics field to display exact allele counts on public browsers as seen in gnomAD ([gnomad.broadinstitute.org](https://gnomad.broadinstitute.org)), _All of Us_ ([https://databrowser.researchallofus.org/snvsindels](https://databrowser.researchallofus.org/snvsindels)) and UK Biobank ([genebass.org](https://genebass.org)), as it
18
+
presents a very low risk to re-identification. Ultimately, the analysis and reporting of rare variants in research manuscripts presents minimal added risk once those variants and their frequencies are already present in a public browser (see below). And there is precedent for allowing such analysis in numerous scientific programs (many of which are NIH-funded). For example, the NHGRI-funded Clinical Genome Resource (ClinGen), working closely with policy leaders at NIH and the GA4GH, published [guidance](https://pubmed.ncbi.nlm.nih.gov/29437798/) to laboratories stating that submission of classified variants, associated with phenotype, were allowable without consent and even if limited to a single observation given low risk to individuals and large benefit to science and medicine<sup>5</sup>. This approach was endorsed by leaders in the UK in a [publication](https://pubmed.ncbi.nlm.nih.gov/31886409/)<sup>6</sup> documenting agreement with these principles and noting allowance under General Data Protection Regulation (GDPR). Furthermore, the ability to publish results of rare variant associations has also been adopted by the UK Biobank leading to widespread benefit to science and medicine without any demonstrated risk. Hundreds of studies of rare variant associations have been published in the past decade based on data released by the UK Biobank without barriers to analysis and publication.
17
19
18
20
<br />
19
21
20
22
## Re-identification risks
21
23
22
-
A handful of well cited publications have shown that, in theory, information about genomic variants is vulnerable to several types of attacks. For instance, it has been shown that the presence/absence of a set of alleles over the genome could allow for a user to probabilistically claim that an individual’s record is in a database (or what is often referred to as a membership inference attack)7 or that their relative is in the database.8 Another risk to participants in the case where a rare variant is published along with its associated phenotype, is that it could allow for direct linkage to a known genomic record9 thus providing the data user with novel information about a participant. However, these types of attacks all assume worst case adversarial situations, which is not likely to be the case in a well-governed setting. It is worth adding that in most of the attack scenarios, the user would learn only that a certain individual is a participant in a large biobank, a fact unlikely to lead to harm. Furthermore, the multi-stage attack described in [9] required having access to linked databases that are not even available anymore.
24
+
A handful of well cited publications have shown that, in theory, information about genomic variants is vulnerable to several types of attacks. For instance, it has been shown that the presence/absence of a set of alleles over the genome could allow for a user to probabilistically claim that an individual’s record is in a database (or what is often referred to as a membership inference attack)<sup>7</sup> or that their relative is in the database.<sup>8</sup> Another risk to participants in the case where a rare variant is published along with its associated phenotype, is that it could allow for direct linkage to a known genomic record<sup>9</sup> thus providing the data user with novel information about a participant. However, these types of attacks all assume worst case adversarial situations, which **is not likely to be the case in a well-governed setting**. It is worth adding that in most of the attack scenarios, the user would learn only that a certain individual is a participant in a large biobank, a fact unlikely to lead to harm. Furthermore, the multi-stage attack described in [9] required having access to linked databases that are not even available anymore.
23
25
24
-
In addition to new papers10 arguing that most genomic data can effectively be shared with minimal re-identification risk, we can see in practice that this is indeed the case: countless rare variants have been published in high profile scientific journals and public databases like ClinVar, and the only known “attacks” have come from the handful of theoretical publications referenced here. For example, over 3.5 million unique variants, classified for pathogenicity towards a specific disease, have been submitted to ClinVar, for which over 75% of these variants have only been identified by a single laboratory. In NIH-funded studies of Mendelian disorders, autism, schizophrenia, cardiovascular disease, and many other human disease phenotypes, there have been millions of rare variants identified from genome sequencing and published as novel disease and trait associations that have set these fields in new directions toward understanding disease etiology and the pursuit of targeted therapeutics. To the best of our knowledge, no participants or patients were harmed whereas trailblazing science and genetic diagnoses have been achieved.
26
+
In addition to new papers<sup>10</sup> arguing that most genomic data can effectively be shared with minimal re-identification risk, we can see in practice that this is indeed the case: countless rare variants have been published in high profile scientific journals and public databases like [ClinVar](https://www.ncbi.nlm.nih.gov/clinvar/), and the only known “attacks” have come from the handful of theoretical publications referenced here. For example, over 3.5 million unique variants, classified for pathogenicity towards a specific disease, have been submitted to ClinVar, for which over 75% of these variants have only been identified by a single laboratory. In NIH-funded studies of Mendelian disorders, autism, schizophrenia, cardiovascular disease, and many other human disease phenotypes, there have been millions of rare variants identified from genome sequencing and published as novel disease and trait associations that have set these fields in new directions toward understanding disease etiology and the pursuit of targeted therapeutics. To the best of our knowledge, no participants or patients were harmed whereas trailblazing science and genetic diagnoses have been achieved.
25
27
26
28
As such, we firmly believe that a policy against sharing low allele counts is protecting against situations that aren’t practical and really only serves to hinder scientific advances.
This is particularly true for structural variants (SVs), as demonstrated in Figure 1 from the gnomAD SV resource.2 Here, we see in panels 1g and 1h that >70% of all SVs observed had allele counts less than 10, and the proportion of singletons is strongly correlated with SV size.
1
+
<br />
2
2
3
-
# GRAPHS SHOULD GO HERE - 1
3
+
This is particularly true for structural variants (SVs), as demonstrated in Figure 1 from the gnomAD SV resource.<sup>2</sup> Here, we see in panels 1g and 1h that >70% of all SVs observed had allele counts less than 10, and the proportion of singletons is strongly correlated with SV size.
0 commit comments