katabase/GROBID_Dictionaries

Using GROBID-dictionaries to encode manuscripts sale catalogs

GROBID-dictionaries is a machine-learning software that automatically encod in XML-TEI lexical and encyclopedic-like resources.

The steps to install GROBID-dictionaries, create new models and train already existing ones to process documents can be found here.

A general model was developed for automatically encoding manuscripts sale catalogs. It can be downloaded from this repository. The training data are extracted from the following catalogs and periodical issues:

Gabriel Charavay, Revue des Autographes, first series : 25, 35, 42, 50, 60, 70, 80, 87, 95, 116, 137.
Gabrielle Charavay Revue des Autographes, second series : 24, 56.
Auguste Laverdet, Catalogue de lettres autographes et manuscrits : 1, 22.
Etienne Charavay, Catalogue d’une intéressante collection de lettres autographes… (14 décembre 1908).

This general model gives satisfying results for all types of manuscripts sale catalogs likely to be processed (fixed-prices or auction catalogs). However, restraining at certain levels the training data that are used can provide even more accurate results and reduce the inaccuracies that need to be corrected by hand.

Choosing the data set you are going to train GROBID-dictionaries with depends on the type and the layout of the series of documents you want to process.

When choose the train set `trainingData_RDA_LAD`?

If you process fixed-prices cataloges and their layout is the same as the Revue des autographes (see below), GROBID-dictionaries should be trained with the trainingData_RDA_LAD train set.

Revue des autographes, Gabriel Charavay. (Première série N°42, Decembre 1874)

The train set contains at every level data extracted only from different issues of the Revue des Autographes (25, 35, 50, 80 of the first series / 24, 56 of the second series).

When choose the train set `trainingData_OTHER_FIXED_PRICES`?

If you process fixed-prices cataloges but their layout is not as structured as the Revue des autographes (see below), GROBID-dictionaries should be trained with the trainingData_OTHER_FIXED_PRICES train set.

Catalogue de lettres autographes et manuscrits, Auguste Laverdet (N°1, April 1856.)

The train set contains at dictionary body segnentation level data extracted only from different issues of Auguste Laverdet's fixed-prices catalogs (issue 1 and issue 22). For the following levels, it contains the same data as the general model.

When choose the train set `trainingData_AUCTION`?

If you process auction cataloges but with no indication of prices, GROBID-dictionaries should be trained with the trainingData_AUCTION train set.

Catalogue d’une intéressante collection de lettres autographes…, Noël Charavay (December, 14th 1908)

The train set contains at dictionary body segnentation level data extracted only from a catalogue published by Etienne Charavay concerning an auctions sale that took place on December, 14th 1908. For the following levels, it contains the same data as the general model.

User guide

The protocol is described in detail in our user guide.

Examples / Data to play with

You can find in some pdf to test our models in the _example_examples folder.

Credits

GROBID-dictionaries is developed by Mohamed Khemakhem (GitHub). More info on GROBID technologies can be found here.

Licence

Regarding GROBID-dictionaries, cf. here.

Regarding the corpus: extracted data is CC-BY.

Cite this dataset

A first version of this dataset as been presetend at the TEI conference. If you use these data, please cite this paper:

@inproceedings{rondeaudunoyer:hal-02272962,
  AUTHOR = {Rondeau Du Noyer, Lucie and Gabay, Simon and Khemakhem, Mohamed and Romary, Laurent},
  TITLE = {Scaling up Automatic Structuring of Manuscript Sales Catalogues},
  ADDRESS = {Graz, Austria},
  MONTH = Sep,
  YEAR = {2019},
  BOOKTITLE = {TEI 2019: What is text, really? TEI and beyond},
  KEYWORDS = {Machine learning ; Manuscript sales catalogues ; 19th c. France},
  URL = {https://hal.inria.fr/hal-02272962},
  PDF = {https://hal.inria.fr/hal-02272962/file/Grobid%20Catalogues%20TEI%202019.pdf},
  HAL_ID = {hal-02272962},
  HAL_VERSION = {v1},
}

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
_examples		_examples
_images		_images
_schemas		_schemas
_transformations		_transformations
trainingData_AUCTION/toyData/dataset		trainingData_AUCTION/toyData/dataset
trainingData_OTHER_FIXED_PRICES/toyData/dataset		trainingData_OTHER_FIXED_PRICES/toyData/dataset
trainingData_RDA_LAD/toyData/dataset		trainingData_RDA_LAD/toyData/dataset
.gitattributes		.gitattributes
.gitignore		.gitignore
DOCUMENTATION.md		DOCUMENTATION.md
GROBID.xpr		GROBID.xpr
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

katabase/GROBID_Dictionaries

Using GROBID-dictionaries to encode manuscripts sale catalogs

When choose the train set `trainingData_RDA_LAD`?

When choose the train set `trainingData_OTHER_FIXED_PRICES`?

When choose the train set `trainingData_AUCTION`?

User guide

Examples / Data to play with

Credits

Licence

Cite this dataset

About

Releases

Packages

Contributors 3

Languages

katabase/Data_extraction

Folders and files

Latest commit

History

Repository files navigation

katabase/GROBID_Dictionaries

Using GROBID-dictionaries to encode manuscripts sale catalogs

When choose the train set trainingData_RDA_LAD?

When choose the train set trainingData_OTHER_FIXED_PRICES?

When choose the train set trainingData_AUCTION?

User guide

Examples / Data to play with

Credits

Licence

Cite this dataset

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

When choose the train set `trainingData_RDA_LAD`?

When choose the train set `trainingData_OTHER_FIXED_PRICES`?

When choose the train set `trainingData_AUCTION`?

Packages