GROBID-dictionaries is a machine-learning software that automatically encod in XML-TEI lexical and encyclopedic-like resources.
The steps to install GROBID-dictionaries, create new models and train already existing ones to process documents can be found here.
A general model was developed for automatically encoding manuscripts sale catalogs. It can be downloaded from this repository. The training data are extracted from the following catalogs and periodical issues:
- Gabriel Charavay, Revue des Autographes, first series : 25, 35, 42, 50, 60, 70, 80, 87, 95, 116, 137.
- Gabrielle Charavay Revue des Autographes, second series : 24, 56.
- Auguste Laverdet, Catalogue de lettres autographes et manuscrits : 1, 22.
- Etienne Charavay, Catalogue d’une intéressante collection de lettres autographes… (14 décembre 1908).
This general model gives satisfying results for all types of manuscripts sale catalogs likely to be processed (fixed-prices or auction catalogs). However, restraining at certain levels the training data that are used can provide even more accurate results and reduce the inaccuracies that need to be corrected by hand.
Choosing the data set you are going to train GROBID-dictionaries with depends on the type and the layout of the series of documents you want to process.
If you process fixed-prices cataloges and their layout is the same as the Revue des autographes (see below), GROBID-dictionaries should be trained with the trainingData_RDA_LAD
train set.
Revue des autographes, Gabriel Charavay. (Première série N°42, Decembre 1874)
The train set contains at every level data extracted only from different issues of the Revue des Autographes (25, 35, 50, 80 of the first series / 24, 56 of the second series).
If you process fixed-prices cataloges but their layout is not as structured as the Revue des autographes (see below), GROBID-dictionaries should be trained with the trainingData_OTHER_FIXED_PRICES
train set.
Catalogue de lettres autographes et manuscrits, Auguste Laverdet (N°1, April 1856.)
The train set contains at dictionary body segnentation level data extracted only from different issues of Auguste Laverdet's fixed-prices catalogs (issue 1 and issue 22). For the following levels, it contains the same data as the general model.
If you process auction cataloges but with no indication of prices, GROBID-dictionaries should be trained with the trainingData_AUCTION
train set.
Catalogue d’une intéressante collection de lettres autographes…, Noël Charavay (December, 14th 1908)
The train set contains at dictionary body segnentation level data extracted only from a catalogue published by Etienne Charavay concerning an auctions sale that took place on December, 14th 1908. For the following levels, it contains the same data as the general model.
The protocol is described in detail in our user guide.
You can find in some pdf to test our models in the _example_examples
folder.
GROBID-dictionaries is developed by Mohamed Khemakhem (GitHub). More info on GROBID technologies can be found here.
Regarding GROBID-dictionaries, cf. here.
Regarding the corpus: extracted data is CC-BY.
A first version of this dataset as been presetend at the TEI conference. If you use these data, please cite this paper:
@inproceedings{rondeaudunoyer:hal-02272962,
AUTHOR = {Rondeau Du Noyer, Lucie and Gabay, Simon and Khemakhem, Mohamed and Romary, Laurent},
TITLE = {Scaling up Automatic Structuring of Manuscript Sales Catalogues},
ADDRESS = {Graz, Austria},
MONTH = Sep,
YEAR = {2019},
BOOKTITLE = {TEI 2019: What is text, really? TEI and beyond},
KEYWORDS = {Machine learning ; Manuscript sales catalogues ; 19th c. France},
URL = {https://hal.inria.fr/hal-02272962},
PDF = {https://hal.inria.fr/hal-02272962/file/Grobid%20Catalogues%20TEI%202019.pdf},
HAL_ID = {hal-02272962},
HAL_VERSION = {v1},
}