Datasets

The pre-processed data for the experiments.

Biomedical Relation Extraction benchmark data

There are various versions of the ChemProt, DDI, and GAD datasets. Here, we adopt the recent and widely used benchmark data, the Biomedical Language Understanding and Reasoning Benchmark (BLURB). We also use the EU-ADR data in BioBERT.

The ChemProt, DDI, and GAD datasets consist of a train/validation/test set, while the EU-ADR contains 10-fold sets for cross validation.
"EU-ADR_BioBERT (train & validation)" is used for the evaluation on different relation context size (detailed in Appendix D in our paper).
In all of the data, target entities are anonymized with predefined tags, including @GENE$, @CHEMICAL$, @DRUG$, and @DISEASE$.
In ChemProt and DDI, additional tags, @CHEM-GENE$ and @DRUG-DRUG$, are used for overlapping entities. When entity markers are used, @CHEM-GENE$ and @DRUG-DRUG$ are surrounded by the [E1-E2] tag.

Table shows the statistics of biomedical relation extraction datasets.

	Train	Dev	Test	Total
ChemProt	18,035	11,268	15,745	45,048
DDI	25,296	2,496	5,716	33,508
GAD	4,261	535	534	5,330
EU-ADR	NA	NA	NA	355

PPI benchmark data

We adopt the unified version of PPI benchmark datasets (AIMed, BioInfer, HPRD50, IEPA, LLL) provided by Pyysalo et al., 2008 that has been used in the SOTA models.

In the datasets, the PPI relations are tagged with either positive or negative.
The data contains 10-fold sets for cross validation.

Table shows the statistics of five ppi benchmark corpora for positive and negative classes.

	Positive	Negative
AIMed	1,000	4,834
BioInfer	2,534	7,132
HPRD50	163	270
IEPA	335	482
LLL	164	166
TOTAL	4,196	12,884

Typed PPI data

Our PPI annotations with interaction types (enzyme, structural, or negative) are the expanded version of the five PPI benchmark corpora and the BioCreative VI protein interaction dataset (Track 4: Mining protein interactions and mutations for precision medicine (PM)).

The data is a 10-fold set for cross validation.
You can find the annotation rules and comments here.

Table displays the corpora statistics. The annotation work in all corpora has been carried out in a sentence boundary as engaged in the five PPI benchmark corpora. The significant reduction from the original data in negative samples is explained in the section III-A3 in our paper (TODO: add a link).

	Enzyme	Structural	Negative
BioCreative VI	378	83	0
AIMed	548	182	1,371
BioInfer	604	1,465	2,148
HPRD50	103	34	87
IEPA	271	2	224
LLL	163	0	0
TOTAL	2,067	1,766	3,830

<< Annotation process diagram >>

Citation

If you use the Typed PPI data for your research, please cite the following paper.

@inproceedings{park2022extracting,
  title={Extracting Protein-Protein Interactions (PPIs) from Biomedical Literature using Attention-based Relational Context Information},
  author={Park, Gilchan and McCorkle, Sean and Soto, Carlos and Blaby, Ian and Yoo, Shinjae},
  booktitle={2022 IEEE International Conference on Big Data (Big Data)},
  pages={2052--2061},
  year={2022},
  organization={IEEE}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Datasets

Biomedical Relation Extraction benchmark data

PPI benchmark data

Typed PPI data

Citation

Files

README.md

Latest commit

History

README.md

File metadata and controls

Datasets

Biomedical Relation Extraction benchmark data

PPI benchmark data

Typed PPI data

Citation