Skip to content

BERT text classification for Finnish OCR texts to study commodification of wild lingon berries

Notifications You must be signed in to change notification settings

ektavats/BerryBERT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 

Repository files navigation

BerryBERT

BERT text classification for Finnish OCR texts to study commodification of wild lingon berries. For more details, refer to our paper here.

Matti La Mela and Ekta Vats, Automatic classification of historical texts using a BERT model: News about wild berries, 1860–1910, Digital History in Sweden Conference (DH Benelux), Belgium, 1-4, 2023.

This implementation uses Simple Transformers - an NLP library based on the Transformers library by HuggingFace.

Dataset:

Berry corpus.

Classification of OCR-ed texts into 2 categories (binary classification):
Category 0: DESCRIPTIVE (i.e. descriptive articles)
Category 1: ECONOMIC (i.e. economic-industrial articles)

The binary division is between articles where berries / berry picking is mentioned for some contextual or descriptive reason.
For example:
Snake bite a berry picking child => 0
Articles regarding selling berries, exports, industrial production, etc. => 1

Prerequisite:

Install Transformers

Note: This program runs on a CPU, and one can add cuda support for processing on a GPU.
Remove "use_cuda=False" from the ClassificationModel instance
Install:
conda install pytorch>=1.6 cudatoolkit=11.0 -c pytorch

BERT models:

We are using Finnish BERT models, and more models can be explored here.
Use the search function to explore!

Contact:

Ekta Vats
[email protected]

About

BERT text classification for Finnish OCR texts to study commodification of wild lingon berries

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published