This is the implementation of Knowledge InteGrated BERT (KIG-BERT) proposed by Aisha and Xiangru for CS 848 (Knowledge Graph) course project.
Find our paper with training and evaluation details in link-to-paper
Abstract: Recent developments in large language modeling have greatly accelerated the performances of NLP applications. Yet they remain largely dependent on their training data and thus prone to being factually inaccurate and socially biased. It is hard to correct the models after the fact due to their large size requiring high compute and large amounts of supervised training data. This paper proposes a minimal compute, no-pretrain framework for improving language model factual accuracy by incorporating knowledge graph information. Unlike human-written text, facts in knowledge graphs like Wikidata are accurate and free from bias. Comparison with baselines shows that our methods have promise in making language models factually accurate as well as retaining language understanding. We also build a facts dataset to test our work using template sentences and Wikidata entities to further evaluate the proposed system.
- Linked Wikitext-2: A dataset that connects spans of text to Wikidata entities.
- Facts Dataset: A dataset consisting of fact-sentences generated using templates and Wikidata entities collected with SPARQL queries.
All the experimental results can be reproduced by the jupyter notebook KIG-Bert.ipynb. Detailed documentation and instruction is in the notebook.
- GPU is needed for training and evaluation.
- Git Large File Storage package is needed. Please find the intrustion on the installing it in installing-git-large-file-storage
- Python version:
3.10
- Install pip packages with
pip3 install -r requirements.txt