BERT text classification for Finnish OCR texts to study commodification of wild lingon berries. For more details, refer to our paper here.
Matti La Mela and Ekta Vats, Automatic classification of historical texts using a BERT model: News about wild berries, 1860–1910, Digital History in Sweden Conference (DH Benelux), Belgium, 1-4, 2023.
This implementation uses Simple Transformers - an NLP library based on the Transformers library by HuggingFace.
Berry corpus.
Classification of OCR-ed texts into 2 categories (binary classification):
Category 0: DESCRIPTIVE (i.e. descriptive articles)
Category 1: ECONOMIC (i.e. economic-industrial articles)
The binary division is between articles where berries / berry picking is mentioned for some contextual or descriptive reason.
For example:
Snake bite a berry picking child => 0
Articles regarding selling berries, exports, industrial production, etc. => 1
Note: This program runs on a CPU, and one can add cuda support for processing on a GPU.
Remove "use_cuda=False" from the ClassificationModel instance
Install:
conda install pytorch>=1.6 cudatoolkit=11.0 -c pytorch
We are using Finnish BERT models, and more models can be explored here.
Use the search function to explore!
Ekta Vats
[email protected]