A software package for array-based molecular blood group typing.
The software infers blood group genotypes measured on an Illumina custom array from the nucleotide to the phenotypic blood group level. Input files are FinalReport files exported from Illumina GenomeStudio. BloodTypingArray determines the blood group alleles either directly by using a SNP-to-blood genotype dictionary or indirectly based on a machine learning approach using TensorFlow [REF1].
There are two ways run the software. The most convenient way is to use a conda environment with all dependencies resolved. To do so, you must have conda installed and copy and paste the following commands to the terminal. If you have a developer OS and do not want to use our preconfigured conda environment, you can instead follow the step-by-step installation below.
# Clone the repository
git clone [email protected]:ikmb/BloodTypingArray.git
# Create conda environment
cd BloodTypingArray/
# The following command creates an environment called bloodtypingarray-1.0
conda env create -f environment.yml
conda activate bloodtypingarray-1.0
# Create executables
./make_executables.sh
You can now jump to the test section of this README.
All executables that are generated during the next step are usually located in a sub-folder dist/Release/GNU-Linux/. The latter part (GNU-Linux) may be different at your OS, so adapt accordingly. Also line 21 and 22 of file ./DeepBloodArray/classifyFinalReports.py may require specific adaptation.
- Install the required tools
sudo apt-get install build-essential
Please make sure you also have the static versions of glibc and stdlibc installed. What you must install depends on what OS you are running. Eventually, you need the static versions of glibc and libstdc++
# CentOS
sudo yum install glibc-static libstdc++-static -y
# Ubuntu
sudo apt-get install libc6-dev
# ...
- Build MyTools
MyTools is a collection of useful cpp classes/methods and must be compiled before compiling the other cpp projects
cd MyTools/
make all
cd ..
The library can be found under: dist/Release/GNU-Linux/
- Build bloodArray
BloodArray reads FinalReport-files, generated by an Illumina GenomeStudio export, and returns a tab delimited table with the inferred blood group alleles. All output goes to stdout, log goes to stderr. This is the experimental direct caller part of the software package. If you are interested in this software part, you should take the logic from the phenotype() functions and reprogram it in a language of your choice.
cd bloodArray/
make all
cd ..
The executable can be found under: dist/Release/GNU-Linux/
run like:
bloodArray FinalReport1.txt [FinalReport2.txt ... FinalReportN.txt]
- Build FinalReportToEvoker
FinalReportToEvoker generates evoker file(s) from FinalReport-files.
cd FinalReportToEvoker/
make all
cd ..
The executable can be found under: dist/Release/GNU-Linux/
Evoker file format usually consists of binary plink [REF2] files plus an extra file with the Allele-AB intensities in binary format. We need these intensities for the TensorFlow classifier. With FinalReportToEvoker we generate a fam file (sample annotation), bim file (SNP annotation) and a bnt file (the Allele-AB intensities). We do not generate the bed file (Genotypes in binary plink format) as we do not need the genotypes.
NOTE: FinalReportToEvoker does no parameter evaluation. If something goes wrong, it crashes without meaningful messages. Please run like:
FinalReportToEvoker consider_those.csv OUTPUFAMFILE OUTPUTBIMFILE OUTPUTBNTFILE FINALREPORTFILE1 [FINALREPORTFILE2 ... FINALREPORTFILEN]
consider_those.csv is a text file with two comma separated columns. One column with the antigen/allele and the other column with the required probe_set_ids. This file can be found in the folder DeepBloodArray.
DeepBloodArray is a python project and contains python script files that are used to infer blood group alleles for the blood groups Rh and MNS. It needs an appropriate environment and is mainly based the following two script files:
- trainAndEvaluate.py trains a new classifier
- classifyFinalReports.py takes a final report as input and returns a json file with the blood group alleles
The folder Models contains pre-trained models.
- Install miniconda3
- environment setup:
conda create -n MyEnvName
conda activate MyEnvName
conda install -c conda-forge tensorflow scikit-learn pandas -y
The neural network was constructed using TensorFlow's Keras API. The architecture consists of three stacked layers: an input layer with 9 neurons, a hidden layer with 6 neurons, and an output layer with one neuron. Throughout the network, ReLU was used as an activation function except for the last neuron which utilizes a Sigmoid activation function that produce values between zero and one. The input shape is determined by the number of SNPs used for the training and so varies for the different antigens. During training, we used RMSprop as an optimizer, and the model was trained for 100 epochs. Lastly, we used binary cross entropy as a loss function to calculate the loss and to optimize the model's weight using RMSprop, as a result the model's predictions shall be interpreted as class probabilities as the problem and the object function have been framed as a binary classification problem.
cd BloodTypingArray
bloodArray/dist/Release/GNU-Linux/bloodarray DeepBloodArray/test/FinalReport1.txt
returns:
Sample_ID filename ABO Rh Lutheran Kell Duffy Kidd Diego Yt Scianna Dombrock Colton Landsteiner-Wiener CROM Knops JR LAN Vel IndianMNS Rh
Sample_ID filename ABO RH LU KEL FY JK DI YT SC DO CO LW CROM KN JR LAN VEL Indian MNS RH
Sample_ID filename ABO RHD BCAM KEL ACKR1 SLC14A1 SLC4A1 ACHE ERMAP ART4 AQP1 ICAM4 CD55 CR1 ABCG2 ABCB6 SMIM1 CD44 GYPA,GYPB RHCE
Sample_ID filename 001 004 005 006 008 009 010 011 013 014 015 016 021 022 032 033 034 023 002 004
pseudoID FinalReport1.txt A D. Lu(a-b+),Au(a-b+),Lu8+Lu14- kk,Kp(a-b+),Js(a-b+) Fy(a-b+) Jk(a-b+) Di(a-b+),Wr(a-b+) Yt(a+b-) Sc1+Sc2- Do(a+b+),Hy+,Jo+ Co(a+b-), LW(a+b-) Cr(a+),Tc(a+b-c-) Kn(a+b-),McC(a+b-),Vil- Jr(a+) Lan+ Vel+ #N/A
FinalReportToEvoker/dist/Release/GNU-Linux/finalreporttoevoker DeepBloodArray/consider_those.csv out.fam out.bim out.bnt DeepBloodArray/test/FinalReport1.txt
should generate the three output files out.fam, out.bim, out.bnt
Uses the two executables tested before and runs the classifier. Finally, it generates a json output.
# If your conda environment is not activated, please activate it with
conda activate MyEnvName
Run the script:
python3 DeepBloodArray/classifyFinalReports.py
should create the result file data_sampleID.json
python3 DeepBloodArray/trainAndEvaluate.py --output /your/output/directory
This should run the training of the classifiers for different antigens ['c','C','e','E','M','N','s','S']
The output directory will contain the trained models (*.mdl) and different plots. An overview plot of the score distribution of the validation samples (20% of all samples) and scatter plots for every SNPs with allele_AB intensities and a color-code that shows to which group the corresponding sample belongs. The training data is reduced to HGDP individuals only. To reproduce the published work, you need to request the German cohort data which are subject to controlled access data protection from PopGen 2.0 Network (P2N) biobank (Access token: P2N_859BH) and add it before running the training (see next section). The trained models provided in this repository were trained with HGDP and P2N samples.
The raw data are stored in rawData_HGDP_only.zip and this archive contains 334 FinalReport files (exported raw data in text format). These are HGDP samples only. To receive the North German samples, please send a request to "PopGen 2.0 Netzwerk (P2N)", [email protected] quoting the access token P2N_859BH.
- Abadi, M, Barham, P, Chen, J, Chen, Z, Davis, A, Dean, J, Devin, M, Ghemawat, S, Irving, G, Isard, M, Kudlur, M, Levenberg, J, Monga, R, Moore, S, Murray, DG, Steiner, B, Tucker, P, Vasudevan, V, Warden, P, Wicke, M, Yu, Y, Zheng, X. {TensorFlow}: a system for {Large-Scale} machine learning. In 12th USENIX symposium on operating systems design and implementation (OSDI 16). 2016. (pp. 265-283).
- Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, Bender D, Maller J, Sklar P, de Bakker PI, Daly MJ, Sham PC. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet. 2007 Sep;81(3):559-75. doi: 10.1086/519795. Epub 2007 Jul 25. PMID: 17701901; PMCID: PMC1950838.