BerryBERT

BERT text classification for Finnish OCR texts to study commodification of wild lingon berries. For more details, refer to our paper here.

Matti La Mela and Ekta Vats, Automatic classification of historical texts using a BERT model: News about wild berries, 1860–1910, Digital History in Sweden Conference (DH Benelux), Belgium, 1-4, 2023.

This implementation uses Simple Transformers - an NLP library based on the Transformers library by HuggingFace.

Dataset:

Berry corpus.

Classification of OCR-ed texts into 2 categories (binary classification):
Category 0: DESCRIPTIVE (i.e. descriptive articles)
Category 1: ECONOMIC (i.e. economic-industrial articles)

The binary division is between articles where berries / berry picking is mentioned for some contextual or descriptive reason.
For example:
Snake bite a berry picking child => 0
Articles regarding selling berries, exports, industrial production, etc. => 1

Prerequisite:

Install Transformers

Note: This program runs on a CPU, and one can add cuda support for processing on a GPU.
Remove "use_cuda=False" from the ClassificationModel instance
Install:
conda install pytorch>=1.6 cudatoolkit=11.0 -c pytorch

BERT models:

We are using Finnish BERT models, and more models can be explored here.
Use the search function to explore!

Contact:

Ekta Vats
[email protected]

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
BERT_for_text_classification_Finnish.ipynb		BERT_for_text_classification_Finnish.ipynb
README.md		README.md
berries_class_binary.csv		berries_class_binary.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BerryBERT

Dataset:

Prerequisite:

BERT models:

Contact:

About

Releases

Packages

Languages

ektavats/BerryBERT

Folders and files

Latest commit

History

Repository files navigation

BerryBERT

Dataset:

Prerequisite:

BERT models:

Contact:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages