Skip to content

Latest commit

 

History

History
182 lines (165 loc) · 5.88 KB

README.md

File metadata and controls

182 lines (165 loc) · 5.88 KB

Datasets

The pre-processed data for the experiments.

Biomedical Relation Extraction benchmark data

There are various versions of the ChemProt, DDI, and GAD datasets. Here, we adopt the recent and widely used benchmark data, the Biomedical Language Understanding and Reasoning Benchmark (BLURB). We also use the EU-ADR data in BioBERT.

  • The ChemProt, DDI, and GAD datasets consist of a train/validation/test set, while the EU-ADR contains 10-fold sets for cross validation.
  • "EU-ADR_BioBERT (train & validation)" is used for the evaluation on different relation context size (detailed in Appendix D in our paper).
  • In all of the data, target entities are anonymized with predefined tags, including @GENE$, @CHEMICAL$, @DRUG$, and @DISEASE$.
  • In ChemProt and DDI, additional tags, @CHEM-GENE$ and @DRUG-DRUG$, are used for overlapping entities. When entity markers are used, @CHEM-GENE$ and @DRUG-DRUG$ are surrounded by the [E1-E2] tag.

Table shows the statistics of biomedical relation extraction datasets.

Train Dev Test Total
ChemProt 18,035 11,268 15,745 45,048
DDI 25,296 2,496 5,716 33,508
GAD 4,261 535 534 5,330
EU-ADR *NA* *NA* *NA* 355

PPI benchmark data

We adopt the unified version of PPI benchmark datasets (AIMed, BioInfer, HPRD50, IEPA, LLL) provided by Pyysalo et al., 2008 that has been used in the SOTA models.

  • In the datasets, the PPI relations are tagged with either positive or negative.
  • The data contains 10-fold sets for cross validation.

Table shows the statistics of five ppi benchmark corpora for positive and negative classes.

Positive Negative
AIMed 1,000 4,834
BioInfer 2,534 7,132
HPRD50 163 270
IEPA 335 482
LLL 164 166
TOTAL 4,196 12,884

Typed PPI data

Our PPI annotations with interaction types (enzyme, structural, or negative) are the expanded version of the five PPI benchmark corpora and the BioCreative VI protein interaction dataset (Track 4: Mining protein interactions and mutations for precision medicine (PM)).

  • The data is a 10-fold set for cross validation.
  • You can find the annotation rules and comments here.

Table displays the corpora statistics. The annotation work in all corpora has been carried out in a sentence boundary as engaged in the five PPI benchmark corpora. The significant reduction from the original data in negative samples is explained in the section III-A3 in our paper (TODO: add a link).

Enzyme Structural Negative
BioCreative VI 378 83 0
AIMed 548 182 1,371
BioInfer 604 1,465 2,148
HPRD50 103 34 87
IEPA 271 2 224
LLL 163 0 0
TOTAL 2,067 1,766 3,830

<< Annotation process diagram >>

Annotation process diagram

Citation

If you use the Typed PPI data for your research, please cite the following paper.

@inproceedings{park2022extracting,
  title={Extracting Protein-Protein Interactions (PPIs) from Biomedical Literature using Attention-based Relational Context Information},
  author={Park, Gilchan and McCorkle, Sean and Soto, Carlos and Blaby, Ian and Yoo, Shinjae},
  booktitle={2022 IEEE International Conference on Big Data (Big Data)},
  pages={2052--2061},
  year={2022},
  organization={IEEE}
}