Skip to content

jedick/pyvers

Repository files navigation

Code style: black

pyvers

Python package for data processing and training of claim verification models. This package was developed as part of an ML engineering capstone project.

Claim verification is a task in natural language processing (NLP) with applications ranging from fact-checking to verifying the accuracy of scientific citations. The models used in this package are based on the transformer deep-learning architecture.

Features

Installation

Run these commands in the root directory of the repository.

  • The first command installs the requirements.
  • The second command installs the pyvers package in development mode.
    • Remove the -e for a standard installation.
pip install -r requirements.txt
pip install -e .

Loading data

pyvers.data.FileDataModule

  • This class loads data from local data files in JSON lines format (jsonl).
  • Supported datasets include SciFact and Citation-Integrity.
  • The schema for the data files is described here.
  • Get data files for SciFact and Citation-Integrity with labels used in pyvers here.
  • The data module can be used to shuffle training data from both datasets.
from pyvers.data import FileDataModule
# Set the model used for the tokenizer
model_name = "bert-base-uncased"

# Load data from one dataset
dm = FileDataModule("data/scifact", model_name)

# Shuffle training data from two datasets
dm = FileDataModule(["data/scifact", "data/citint"], model_name)

# Get some tokenized data
dm.setup("fit")
next(iter(dm.train_dataloader()))

pyvers.data.NLIDataModule

from pyvers.data import NLIDataModule
model_name = "bert-base-uncased"

# Load data from HuggingFace datasets
dm = NLIDataModule("facebook/anli", model_name)

# Get some tokenized data
dm.prepare_data()
dm.setup("fit")
next(iter(dm.train_dataloader()))

pyvers.data.ToyDataModule

  • This is a small handmade toy dataset.
  • There are no data files; the dataset is hard-coded in the class definition.

Fine-tuning example

This takes about a minute on a CPU.

# Import required modules
import pytorch_lightning as pl
from pyvers.data import ToyDataModule
from pyvers.model import PyversClassifier

# Initialize data and model
dm = ToyDataModule("bert-base-uncased")
model = PyversClassifier(dm.model_name)

# Train model
trainer = pl.Trainer(enable_checkpointing=False, max_epochs=20)
trainer.fit(model, datamodule=dm)

# Test model
trainer.test(model, datamodule=dm)

# Show predictions
predictions = trainer.predict(model, datamodule=dm)
print(predictions)

This is what we get (results vary between runs):

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃        Test metric        ┃       DataLoader 0        ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│        AUROC Macro        │          0.963            │
│      AUROC Weighted       │          0.963            │
│         Accuracy          │           88.9            │
│         F1 Macro          │           88.6            │
│         F1 Micro          │           88.9            │
│          F1_NEI           │          100.0            │
│         F1_REFUTE         │           80.0            │
│        F1_SUPPORT         │           85.7            │
└───────────────────────────┴───────────────────────────┘

[['SUPPORT', 'SUPPORT', 'SUPPORT', 'NEI', 'NEI', 'NEI', 'REFUTE', 'REFUTE', 'SUPPORT']]

# Ground-truth labels are:
# [['SUPPORT', 'SUPPORT', 'SUPPORT', 'NEI', 'NEI', 'NEI', 'REFUTE', 'REFUTE', 'REFUTE']]

Zero-shot example

This uses a DeBERTa model trained on MultiNLI, Fever-NLI and Adversarial-NLI (ANLI) for zero-shot classification of claim-evidence pairs.

import pytorch_lightning as pl
from pyvers.model import PyversClassifier
from pyvers.data import ToyDataModule
dm = ToyDataModule("MoritzLaurer/DeBERTa-v3-base-mnli-fever-anli")
model = PyversClassifier(dm.model_name)
trainer = pl.Trainer()
dm.setup(stage="test")
predictions = trainer.predict(model, datamodule=dm)
print(predictions)
# [['SUPPORT', 'SUPPORT', 'SUPPORT', 'REFUTE', 'REFUTE', 'REFUTE', 'REFUTE', 'REFUTE', 'REFUTE']]

The pretrained model successfully distinguishes between SUPPORT and REFUTE on the toy dataset but misclassifies NEI as REFUTE. This can be improved with fine-tuning.

When using a pre-trained model for zero-shot classification, check the mapping between labels and IDs.

from transformers import AutoConfig

model_name = "answerdotai/ModernBERT-base"
config = AutoConfig.from_pretrained(model_name, num_labels=3)
print(config.to_dict()["id2label"])
# {0: 'LABEL_0', 1: 'LABEL_1', 2: 'LABEL_2'}

model_name = "MoritzLaurer/DeBERTa-v3-base-mnli-fever-anli"
config = AutoConfig.from_pretrained(model_name, num_labels=3)
print(config.to_dict()["id2label"])
# {0: 'entailment', 1: 'neutral', 2: 'contradiction'}

Because it uses labels that are consistent with the NLI categories listed below, for zero-shot classification we would choose the pretrained DeBERTa model rather than ModernBERT. However, fine-tuning either model for text classification should work (see this page for information on fine-tuning ModernBERT).

Label to ID mapping

ID pyvers Fever* MultiNLI, ANLI
0 SUPPORT SUPPORTS entailment
1 NEI NOT ENOUGH INFO neutral
2 REFUTE REFUTES contradiction

* Text labels only

About

Python package for data processing and training of claim verification models

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages