The SemEval-2025 Task 5 calls for the utilization of LLM capabilities to apply controlled subject labels to record descriptions in the multilingual library collection of the German National Library of Science and Technology.
The multilingual BERT ensemble system described herein produces GND subject labels for various record types, including articles, books, conference papers, reports, and theses. Input a title and abstract in German or English to generate GND subject labels.
The AutoTrain Advanced software package was used to train BERT models for GND classification based on examples from the TIB "All Subjects" dataset. A curated set of that data spilt into validation and train is available from Hugging Face
This code was developed to test which set of models contributed to the highest scores using 1000 rows of held out data as the gold standard.
Inference code generates labels and aggregates label confidence scores so the BERT models work as an ensemble during inference.
Jim was assisted by GitHub Copilot, for development of the inference and testing code.