pyvers

Python package for data processing and training of claim verification models. This package was developed as part of an ML engineering capstone project.

Claim verification is a task in natural language processing (NLP) with applications ranging from fact-checking to verifying the accuracy of scientific citations. The models used in this package are based on the transformer deep-learning architecture.

Features

Data Modules
- Support for local files and HuggingFace datasets.
- Consistent label encoding for different natural language inference (NLI) datasets (see below).
- Supports shuffling training data from multiple datasets for improved model generalization.
Trainer
- Training and data modules implemented with PyTorch Lightning.
- Use any pretrained sequence classification model from HuggingFace.
- Logger is configured to plot training and validation loss on the same graph in TensorBoard.

Installation

Run these commands in the root directory of the repository.

The first command installs the requirements.
The second command installs the pyvers package in development mode.
- Remove the -e for a standard installation.

pip install -r requirements.txt
pip install -e .

Loading data

`pyvers.data.FileDataModule`

This class loads data from local data files in JSON lines format (jsonl).
Supported datasets include SciFact and Citation-Integrity.
The schema for the data files is described here.
Get data files for SciFact and Citation-Integrity with labels used in pyvers here.
The data module can be used to shuffle training data from both datasets.

from pyvers.data import FileDataModule
# Set the model used for the tokenizer
model_name = "bert-base-uncased"

# Load data from one dataset
dm = FileDataModule("data/scifact", model_name)

# Shuffle training data from two datasets
dm = FileDataModule(["data/scifact", "data/citint"], model_name)

# Get some tokenized data
dm.setup("fit")
next(iter(dm.train_dataloader()))

`pyvers.data.NLIDataModule`

This class loads data from selected HuggingFace datasets.
Supported datasets are copenlu/fever_gold_evidence, facebook/anli, and nyu-mll/multi_nli.

from pyvers.data import NLIDataModule
model_name = "bert-base-uncased"

# Load data from HuggingFace datasets
dm = NLIDataModule("facebook/anli", model_name)

# Get some tokenized data
dm.prepare_data()
dm.setup("fit")
next(iter(dm.train_dataloader()))

`pyvers.data.ToyDataModule`

This is a small handmade toy dataset.
There are no data files; the dataset is hard-coded in the class definition.

Fine-tuning example

This takes about a minute on a CPU.

# Import required modules
import pytorch_lightning as pl
from pyvers.data import ToyDataModule
from pyvers.model import PyversClassifier

# Initialize data and model
dm = ToyDataModule("bert-base-uncased")
model = PyversClassifier(dm.model_name)

# Train model
trainer = pl.Trainer(enable_checkpointing=False, max_epochs=20)
trainer.fit(model, datamodule=dm)

# Test model
trainer.test(model, datamodule=dm)

# Show predictions
predictions = trainer.predict(model, datamodule=dm)
print(predictions)

This is what we get (results vary between runs):

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃        Test metric        ┃       DataLoader 0        ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│        AUROC Macro        │          0.963            │
│      AUROC Weighted       │          0.963            │
│         Accuracy          │           88.9            │
│         F1 Macro          │           88.6            │
│         F1 Micro          │           88.9            │
│          F1_NEI           │          100.0            │
│         F1_REFUTE         │           80.0            │
│        F1_SUPPORT         │           85.7            │
└───────────────────────────┴───────────────────────────┘

[['SUPPORT', 'SUPPORT', 'SUPPORT', 'NEI', 'NEI', 'NEI', 'REFUTE', 'REFUTE', 'SUPPORT']]

# Ground-truth labels are:
# [['SUPPORT', 'SUPPORT', 'SUPPORT', 'NEI', 'NEI', 'NEI', 'REFUTE', 'REFUTE', 'REFUTE']]

Zero-shot example

This uses a DeBERTa model trained on MultiNLI, Fever-NLI and Adversarial-NLI (ANLI) for zero-shot classification of claim-evidence pairs.

import pytorch_lightning as pl
from pyvers.model import PyversClassifier
from pyvers.data import ToyDataModule
dm = ToyDataModule("MoritzLaurer/DeBERTa-v3-base-mnli-fever-anli")
model = PyversClassifier(dm.model_name)
trainer = pl.Trainer()
dm.setup(stage="test")
predictions = trainer.predict(model, datamodule=dm)
print(predictions)
# [['SUPPORT', 'SUPPORT', 'SUPPORT', 'REFUTE', 'REFUTE', 'REFUTE', 'REFUTE', 'REFUTE', 'REFUTE']]

The pretrained model successfully distinguishes between SUPPORT and REFUTE on the toy dataset but misclassifies NEI as REFUTE. This can be improved with fine-tuning.

When using a pre-trained model for zero-shot classification, check the mapping between labels and IDs.

from transformers import AutoConfig

model_name = "answerdotai/ModernBERT-base"
config = AutoConfig.from_pretrained(model_name, num_labels=3)
print(config.to_dict()["id2label"])
# {0: 'LABEL_0', 1: 'LABEL_1', 2: 'LABEL_2'}

model_name = "MoritzLaurer/DeBERTa-v3-base-mnli-fever-anli"
config = AutoConfig.from_pretrained(model_name, num_labels=3)
print(config.to_dict()["id2label"])
# {0: 'entailment', 1: 'neutral', 2: 'contradiction'}

Because it uses labels that are consistent with the NLI categories listed below, for zero-shot classification we would choose the pretrained DeBERTa model rather than ModernBERT. However, fine-tuning either model for text classification should work (see this page for information on fine-tuning ModernBERT).

Label to ID mapping

ID	pyvers	Fever*	MultiNLI, ANLI
0	SUPPORT	SUPPORTS	entailment
1	NEI	NOT ENOUGH INFO	neutral
2	REFUTE	REFUTES	contradiction

* Text labels only

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
.github/workflows		.github/workflows
pyvers		pyvers
scripts		scripts
.git-blame-ignore-revs		.git-blame-ignore-revs
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

pyvers

Features

Installation

Loading data

`pyvers.data.FileDataModule`

`pyvers.data.NLIDataModule`

`pyvers.data.ToyDataModule`

Fine-tuning example

Zero-shot example

Label to ID mapping

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

jedick/pyvers

Folders and files

Latest commit

History

Repository files navigation

pyvers

Features

Installation

Loading data

pyvers.data.FileDataModule

pyvers.data.NLIDataModule

pyvers.data.ToyDataModule

Fine-tuning example

Zero-shot example

Label to ID mapping

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

`pyvers.data.FileDataModule`

`pyvers.data.NLIDataModule`

`pyvers.data.ToyDataModule`

Packages