Skip to content

Commit 790a8ea

Browse files
authored
Merge pull request #40 from airr-community/development
In preparation for release
2 parents 48027d3 + f275db3 commit 790a8ea

29 files changed

+823
-112
lines changed

.gitignore

+2
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
# emacs backup
2+
*~

.travis.yml

+2-2
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,6 @@ language: python
22
python:
33
- 3.6
44
install:
5-
- pip install pyyaml
5+
- pip install pyyaml pandas xlrd deepdiff
66
script:
7-
- scripts/ensure-consistency.py
7+
- scripts/check-consistency.py

AIRR_Minimal_Standard_Data_Elements.tsv

+83-83
Large diffs are not rendered by default.

MiAIRR-Elements_NCBI_mapping.xls

39 KB
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.

NCBI_implementation/README.md

+4-4
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
![Image](https://github.com/airr-community/airr-standards/raw/master/Images/miairr_logo.png)
1+
![Image](https://github.com/airr-community/airr-standards/raw/master/images/miairr_logo.png)
22

33
_Minimum information about an Adaptive Immune Receptor Repertoire Sequencing Experiment_
44

@@ -20,13 +20,13 @@ elements within these sets are defined
2020
[here](https://github.com/airr-community/airr-standards/blob/master/AIRR_Minimal_Standard_Data_Elements.tsv). The
2121
association between these AIRR sets, the associated data elements, and each of the NCBI repositories is shown below:
2222

23-
![Image](https://github.com/airr-community/airr-standards/raw/master/Images/MiAIRR_data_elements_NCBI_targets.png)
23+
![Image](https://github.com/airr-community/airr-standards/raw/master/images/MiAIRR_data_elements_NCBI_targets.png)
2424

2525
Submission of AIRR sequencing data and metadata to NCBI's public data repositories consists of five sequential steps:
2626

2727
1. Submit study information to [NCBI BioProject](https://submit.ncbi.nlm.nih.gov/subs/bioproject/) using the NCBI web interface.
28-
2. Submit sample-level information to the [NCBI BioSample repository](https://submit.ncbi.nlm.nih.gov/subs/biosample/) using the [AIRR-BioSample templates](https://github.com/airr-community/airr-standards/raw/master/NCBI_implementation/NCBI%20Templates/AIRR_BioSample_v1.0.xls).
29-
3. Submit raw sequencing data to [NCBI SRA](https://submit.ncbi.nlm.nih.gov/subs/sra/) using the [AIRR-SRA data templates](https://github.com/airr-community/airr-standards/raw/master/NCBI_implementation/NCBI%20Templates/AIRR_SRA_v1.0.xls).
28+
2. Submit sample-level information to the [NCBI BioSample repository](https://submit.ncbi.nlm.nih.gov/subs/biosample/) using the [AIRR-BioSample templates](https://github.com/airr-community/airr-standards/raw/master/NCBI_implementation/templates_XLS/AIRR_BioSample_v1.0.xls).
29+
3. Submit raw sequencing data to [NCBI SRA](https://submit.ncbi.nlm.nih.gov/subs/sra/) using the [AIRR-SRA data templates](https://github.com/airr-community/airr-standards/raw/master/NCBI_implementation/templates_XLS/AIRR_SRA_v1.0.xls).
3030
4. Generate a DOI for the protocol describing how raw sequencing data were processed using [Zenodo](https://zenodo.org) or an equivalent DOI-granting service.
3131
5. Submit processed sequencing data with sequence-level annotations to [GenBank](https://www.ncbi.nlm.nih.gov/genbank/tbl2asn2/) using AIRR feature tags.
3232

Original file line numberDiff line numberDiff line change
@@ -0,0 +1,40 @@
1+
AIRR Formats WG field name NCBI BioSample attribute Keyword relation Mandatory BioSample attribute Note
2+
study_id bioproject_accession MAPPED FALSE Reference to the associated BioProject record
3+
subject_id isolate MAPPED TRUE
4+
synthetic synthetic AIRR_CUSTOM FALSE
5+
organism organism IDENTICAL TRUE
6+
sex sex IDENTICAL TRUE
7+
age age IDENTICAL TRUE To be IDENTICAL, `age` MUST be age of subject at sampling time point. In contrast, MiAIRR also allows other reference time points for `age`
8+
age_event age_event AIRR_CUSTOM FALSE Value for this field MUST be `sampling` to be consistent with BioSample's `age` definition. See `age`
9+
ancestry_population population MAPPED FALSE BioSample attributes `(super_)population_*` were not used as they encode keywords from the Coriell Institute, whose suitability for MiAIRR has not yet been fully evalutated
10+
ethnicity ethnicity IDENTICAL FALSE
11+
race race IDENTICAL FALSE
12+
strain_name strain MAPPED FALSE BioSample has separate attributes for `strain` and `breed`. MiAIRR has only one keyword (`strain_name`) for this information
13+
linked_subjects linked_subjects AIRR_CUSTOM FALSE BioSample attributes `family_*` were not used as they suggest a restriction to genetic relationship
14+
link_type link_type AIRR_CUSTOM FALSE BioSample attributes `family_*` were not used as they suggest a restriction to genetic relationship
15+
study_group_description study_group_description AIRR_CUSTOM FALSE
16+
disease_diagnosis disease MAPPED FALSE
17+
disease_length disease_length AIRR_CUSTOM FALSE
18+
disease_stage disease_stage IDENTICAL FALSE
19+
prior_therapies prior_therapies AIRR_CUSTOM FALSE
20+
immunogen immunogen AIRR_CUSTOM FALSE
21+
intervention treatment MAPPED FALSE
22+
medical_history medical_history AIRR_CUSTOM FALSE
23+
sample_id sample_name MAPPED TRUE BioSample attirbute `bio_material` has an overlapping meaning, however it is not required for submission
24+
sample_type sample_type IDENTICAL FALSE
25+
tissue tissue IDENTICAL TRUE
26+
anatomic_site anatomic_site AIRR_CUSTOM FALSE
27+
disease_state_sample health_state MAPPED FALSE
28+
collection_time_point_relative collection_time_point_relative AIRR_CUSTOM FALSE BioSample attribute `collection_date` was not used as it defines an absolute date
29+
collection_time_point_reference collection_time_point_reference AIRR_CUSTOM FALSE
30+
biomaterial_provider biomaterial_provider IDENTICAL TRUE
31+
tissue_processing tissue_processing AIRR_CUSTOM FALSE
32+
cell_subset cell_type MAPPED FALSE
33+
cell_phenotype cell_phenotype AIRR_CUSTOM FALSE
34+
single_cell single_cell AIRR_CUSTOM FALSE
35+
cell_number cell_number AIRR_CUSTOM FALSE
36+
cells_per_reaction cells_per_reaction AIRR_CUSTOM FALSE
37+
cell_storage cell_storage AIRR_CUSTOM FALSE
38+
cell_quality cell_quality AIRR_CUSTOM FALSE
39+
cell_isolation cell_isolation AIRR_CUSTOM FALSE
40+
cell_processing_protocol cell_processing_protocol AIRR_CUSTOM FALSE
+34
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
AIRR Formats WG field name NCBI SRA attribute Keyword relation Mandatory SRA attribute Note
2+
study_id bioproject_accession MAPPING TRUE
3+
sample_id sample_name MAPPING TRUE
4+
nucleic_acid_processing_id library_ID MAPPING TRUE
5+
NULL title DATABASE_SPECIFIC TRUE
6+
NULL library_strategy DATABASE_SPECIFIC TRUE
7+
NULL library_source DATABASE_SPECIFIC TRUE
8+
NULL library_selection DATABASE_SPECIFIC TRUE
9+
NULL library_layout DATABASE_SPECIFIC TRUE
10+
NULL platform DATABASE_SPECIFIC TRUE
11+
sequencing_platform instrument_model MAPPING TRUE SRA splits this information into `platform` and `instrument_model`, however the controlled vocabulary of the latter one also often contains the `platform` information. Therefore preference was given to a 1:1 mapping using `instrument_model`
12+
library_generation_protocol design_description MAPPING TRUE
13+
NULL filetype DATABASE_SPECIFIC TRUE
14+
NULL filename DATABASE_SPECIFIC TRUE
15+
NULL filename2 DATABASE_SPECIFIC FALSE
16+
NULL filename3 DATABASE_SPECIFIC FALSE
17+
NULL filename4 DATABASE_SPECIFIC FALSE
18+
NULL assembly DATABASE_SPECIFIC FALSE
19+
template_class template_class AIRR_CUSTOM FALSE SRA keyword `library_source` is related to this field, but makes a number of distinctions (bulk vs. single-cell) that are incompatible with the current definition of `template_class`
20+
template_quality template_quality AIRR_CUSTOM FALSE
21+
template_amount template_amount AIRR_CUSTOM FALSE
22+
library_generation_method library_generation_method AIRR_CUSTOM FALSE SRA keyword `library_strategy` is related to this field, but uses a controlled vocubulary that is not fine-grained enough to provide the required information of MiAIRR `library_generation_method` (e.g. mode of cDNA generation, UMI, etc.)
23+
library_generation_kit_version library_generation_kit_version AIRR_CUSTOM FALSE
24+
pcr_target_locus pcr_target_locus AIRR_CUSTOM FALSE
25+
forward_pcr_primer_target_location forward_pcr_primer_target_location AIRR_CUSTOM FALSE
26+
reverse_pcr_primer_target_location reverse_pcr_primer_target_location AIRR_CUSTOM FALSE
27+
complete_sequences complete_sequences AIRR_CUSTOM FALSE
28+
physical_linkage physical_linkage AIRR_CUSTOM FALSE
29+
total_reads_passing_qc_filter total_reads_passing_qc_filter AIRR_CUSTOM FALSE
30+
read_length read_length AIRR_CUSTOM FALSE
31+
sequencing_facility sequencing_facility AIRR_CUSTOM FALSE
32+
sequencing_run_id sequencing_run_id AIRR_CUSTOM FALSE
33+
sequencing_run_date sequencing_run_date AIRR_CUSTOM FALSE
34+
sequencing_kit sequencing_kit AIRR_CUSTOM FALSE
Binary file not shown.

README.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
![Image](https://github.com/airr-community/airr-standards/raw/master/Images/miairr_logo.png)
1+
![Image](https://github.com/airr-community/airr-standards/raw/master/images/miairr_logo.png)
22

33
_Minimum information about an Adaptive Immune Receptor Repertoire Sequencing Experiment_
44

File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.

scripts/check-consistency.py

+118
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,118 @@
1+
#! /usr/bin/env python
2+
3+
import sys
4+
from collections import Counter
5+
6+
import yaml
7+
import csv
8+
from deepdiff import DeepDiff
9+
10+
object_map = { '1 / study': 'MiAIRR_Study',
11+
'1 / subject': 'MiAIRR_Subject',
12+
'1 / diag. & intervent.': 'MiAIRR_Diagnosis',
13+
'2 / sample': 'MiAIRR_Sample',
14+
'3 / process (cell)': 'MiAIRR_CellProcessing',
15+
'3 / process (nucl. acid)': 'MiAIRR_NucleicAcidProcessing',
16+
'5 / process (comput.)': 'MiAIRR_SoftwareProcessing',
17+
'6 / data (proc. seq.)': 'MiAIRR_Rearrangement' }
18+
19+
with open('AIRR_Minimal_Standard_Data_Elements.tsv', 'r') as ip:
20+
dictReader = csv.DictReader(ip, dialect='excel-tab')
21+
miairr_elements = [line for line in dictReader]
22+
23+
with open('AIRR_Minimal_Standard_Data_Elements.tsv', 'r') as ip:
24+
# header line present
25+
assert next(ip).split()[0] == 'MiAIRR'
26+
27+
table = [line.split('\t')[6].strip() for line in ip]
28+
# handle the exceptional 4 / data line
29+
assert table.count('') == 1
30+
_ = table.pop(table.index(''))
31+
32+
with open('specs/definitions.yaml', 'r') as ip:
33+
definitions = yaml.load(ip)
34+
properties = [property
35+
for obj in definitions.values()
36+
for property in obj['properties']
37+
if obj.get('discriminator') == 'MiAIRR']
38+
39+
failed = False
40+
41+
# check for uniqueness of fields in AIRR_Minimal_Standard_Data_Elements.tsv
42+
if len(table) != len(set(table)):
43+
print('Duplicate entries found in AIRR_Minimal_Standard_Data_Elements.tsv', file=sys.stderr)
44+
for k, v in Counter(table).items():
45+
if v > 1:
46+
print(f'{k:30} found {v} times in tsv when it should be unique\n', file=sys.stderr)
47+
failed = True
48+
49+
# check for differences in fields between specs/definitions.yaml and
50+
# AIRR_Minimal_Standard_Data_Elements.tsv
51+
for key in object_map.keys():
52+
elements = [element['AIRR Formats WG field name'] for element in miairr_elements
53+
if element['MiAIRR data set / subset'] == key]
54+
definition = definitions.get(object_map[key])
55+
if not definition:
56+
print(f'{object_map[key]} not found in definitions.yaml.\n', file=sys.stderr)
57+
failed = True
58+
continue
59+
60+
properties = [property for property in definition['properties']]
61+
if set(elements) != set(properties):
62+
print(f'{object_map[key]} does not match TSV', file=sys.stderr)
63+
for field in set(properties) - set(elements):
64+
print(f'{field:30} is found in yaml but not tsv for {object_map[key]}', file=sys.stderr)
65+
for field in set(elements) - set(properties):
66+
print(f'{field:30} is found in tsv but not yaml for {object_map[key]}', file=sys.stderr)
67+
failed = True
68+
69+
# check that MiAIRR object definitions contained
70+
# within AIRR definition
71+
for definition in definitions.keys():
72+
if definitions[definition].get('discriminator') == 'MiAIRR':
73+
name = definition.split('_')[1]
74+
if not definitions.get(name):
75+
print(f'{name} corresponding to {definition} not found in definitions.yaml', file=sys.stderr)
76+
failed = True
77+
continue
78+
79+
for prop in definitions[definition]['properties']:
80+
if not definitions[name]['properties'].get(prop):
81+
print(f'{prop} in {definition} object is not in {name} object.', file=sys.stderr)
82+
failed = True
83+
continue
84+
ddiff = DeepDiff(definitions[definition]['properties'][prop], definitions[name]['properties'][prop], ignore_order=True)
85+
if ddiff:
86+
print(f'{prop} in {definition} object is not the same object in {name}.', file=sys.stderr)
87+
print(ddiff, file=sys.stderr)
88+
failed = True
89+
90+
# check consistency with NCBI XML definitions, per @BusseChristian's pseudocode
91+
# in https://github.com/airr-community/airr-standards/issues/20
92+
import pandas as pd
93+
94+
miairr_table = pd.read_csv('AIRR_Minimal_Standard_Data_Elements.tsv', sep='\t', header=0, index_col=None)
95+
miairr_biosample_rows = miairr_table.iloc[:, 0].isin(["1 / subject", "1 / diag. & intervent.", "2 / sample", "3 / process (cell)"])
96+
miairr_identifiers = set(miairr_table[miairr_biosample_rows].iloc[:, 6])
97+
miairr_identifiers.add('study_id') # manually add
98+
miairr_mapping = {}
99+
with open('NCBI_implementation/mapping_MiAIRR_BioSample.tsv', 'r') as ip:
100+
dictReader = csv.DictReader(ip, dialect='excel-tab')
101+
for line in dictReader:
102+
miairr_mapping[line['AIRR Formats WG field name']] = line['NCBI BioSample attribute']
103+
mapped_identifiers = set([miairr_mapping.get(name, name) for name in miairr_identifiers])
104+
105+
ncbi_biosample = pd.read_excel('NCBI_implementation/templates_XLS/AIRR_BioSample_v1.0.xls', skiprows=13)
106+
ncbi_identifiers = set([x.lstrip('*') for x in ncbi_biosample.columns])
107+
108+
if mapped_identifiers != ncbi_identifiers:
109+
print('AIRR_Minimal_Standard_Data_Elements.tsv does not match AIRR_BioSample_v1.0.xls', file=sys.stderr)
110+
for field in set(mapped_identifiers) - set(ncbi_identifiers):
111+
print(f'{field:30} is found in MiAIRR table tsv but not in NCBI Biosample template xls', file=sys.stderr)
112+
for field in set(ncbi_identifiers) - set(mapped_identifiers):
113+
print(f'{field:30} is found in NCBI Biosample template xls but not in MiAIRR table tsv', file=sys.stderr)
114+
failed = True
115+
116+
if failed:
117+
print('consistency checks failed', file=sys.stderr)
118+
sys.exit(1)

0 commit comments

Comments
 (0)