The pre-processed data for the experiments.
There are various versions of the ChemProt, DDI, and GAD datasets. Here, we adopt the recent and widely used benchmark data, the Biomedical Language Understanding and Reasoning Benchmark (BLURB). We also use the EU-ADR data in BioBERT.
- The ChemProt, DDI, and GAD datasets consist of a train/validation/test set, while the EU-ADR contains 10-fold sets for cross validation.
- "EU-ADR_BioBERT (train & validation)" is used for the evaluation on different relation context size (detailed in Appendix D in our paper).
- In all of the data, target entities are anonymized with predefined tags, including
@GENE$
,@CHEMICAL$
,@DRUG$
, and@DISEASE$
. - In ChemProt and DDI, additional tags,
@CHEM-GENE$
and@DRUG-DRUG$
, are used for overlapping entities. When entity markers are used,@CHEM-GENE$
and@DRUG-DRUG$
are surrounded by the[E1-E2]
tag.
Table shows the statistics of biomedical relation extraction datasets.
Train | Dev | Test | Total | |
---|---|---|---|---|
ChemProt | 18,035 | 11,268 | 15,745 | 45,048 |
DDI | 25,296 | 2,496 | 5,716 | 33,508 |
GAD | 4,261 | 535 | 534 | 5,330 |
EU-ADR | *NA* | *NA* | *NA* | 355 |
We adopt the unified version of PPI benchmark datasets (AIMed, BioInfer, HPRD50, IEPA, LLL) provided by Pyysalo et al., 2008 that has been used in the SOTA models.
- In the datasets, the PPI relations are tagged with either positive or negative.
- The data contains 10-fold sets for cross validation.
Table shows the statistics of five ppi benchmark corpora for positive and negative classes.
Positive | Negative | |
---|---|---|
AIMed | 1,000 | 4,834 |
BioInfer | 2,534 | 7,132 |
HPRD50 | 163 | 270 |
IEPA | 335 | 482 |
LLL | 164 | 166 |
TOTAL | 4,196 | 12,884 |
Our PPI annotations with interaction types (enzyme, structural, or negative) are the expanded version of the five PPI benchmark corpora and the BioCreative VI protein interaction dataset (Track 4: Mining protein interactions and mutations for precision medicine (PM)).
- The data is a 10-fold set for cross validation.
- You can find the annotation rules and comments here.
Table displays the corpora statistics. The annotation work in all corpora has been carried out in a sentence boundary as engaged in the five PPI benchmark corpora. The significant reduction from the original data in negative samples is explained in the section III-A3 in our paper (TODO: add a link).
Enzyme | Structural | Negative | |
---|---|---|---|
BioCreative VI | 378 | 83 | 0 |
AIMed | 548 | 182 | 1,371 |
BioInfer | 604 | 1,465 | 2,148 |
HPRD50 | 103 | 34 | 87 |
IEPA | 271 | 2 | 224 |
LLL | 163 | 0 | 0 |
TOTAL | 2,067 | 1,766 | 3,830 |
<< Annotation process diagram >>
If you use the Typed PPI data for your research, please cite the following paper.
@inproceedings{park2022extracting,
title={Extracting Protein-Protein Interactions (PPIs) from Biomedical Literature using Attention-based Relational Context Information},
author={Park, Gilchan and McCorkle, Sean and Soto, Carlos and Blaby, Ian and Yoo, Shinjae},
booktitle={2022 IEEE International Conference on Big Data (Big Data)},
pages={2052--2061},
year={2022},
organization={IEEE}
}