Skip to content

This repository contains everything we need for the data extraction.

Notifications You must be signed in to change notification settings

katabase/Data_extraction

Repository files navigation

katabase/GROBID_Dictionaries


Using GROBID-dictionaries to encode manuscripts sale catalogs

GROBID-dictionaries is a machine-learning software that automatically encod in XML-TEI lexical and encyclopedic-like resources.

The steps to install GROBID-dictionaries, create new models and train already existing ones to process documents can be found here.

A general model was developed for automatically encoding manuscripts sale catalogs. It can be downloaded from this repository. The training data are extracted from the following catalogs and periodical issues:

  • Gabriel Charavay, Revue des Autographes, first series : 25, 35, 42, 50, 60, 70, 80, 87, 95, 116, 137.
  • Gabrielle Charavay Revue des Autographes, second series : 24, 56.
  • Auguste Laverdet, Catalogue de lettres autographes et manuscrits : 1, 22.
  • Etienne Charavay, Catalogue d’une intéressante collection de lettres autographes… (14 décembre 1908).

This general model gives satisfying results for all types of manuscripts sale catalogs likely to be processed (fixed-prices or auction catalogs). However, restraining at certain levels the training data that are used can provide even more accurate results and reduce the inaccuracies that need to be corrected by hand.

Choosing the data set you are going to train GROBID-dictionaries with depends on the type and the layout of the series of documents you want to process.

When choose the train set trainingData_RDA_LAD?

If you process fixed-prices cataloges and their layout is the same as the Revue des autographes (see below), GROBID-dictionaries should be trained with the trainingData_RDA_LAD train set.

Revue des autographes, Gabriel Charavay. (Première série N°42, Decembre 1874)


The train set contains at every level data extracted only from different issues of the Revue des Autographes (25, 35, 50, 80 of the first series / 24, 56 of the second series).

When choose the train set trainingData_OTHER_FIXED_PRICES?

If you process fixed-prices cataloges but their layout is not as structured as the Revue des autographes (see below), GROBID-dictionaries should be trained with the trainingData_OTHER_FIXED_PRICES train set.

Catalogue de lettres autographes et manuscrits, Auguste Laverdet (N°1, April 1856.)


The train set contains at dictionary body segnentation level data extracted only from different issues of Auguste Laverdet's fixed-prices catalogs (issue 1 and issue 22). For the following levels, it contains the same data as the general model.

When choose the train set trainingData_AUCTION?

If you process auction cataloges but with no indication of prices, GROBID-dictionaries should be trained with the trainingData_AUCTION train set.

Catalogue d’une intéressante collection de lettres autographes…, Noël Charavay (December, 14th 1908)


The train set contains at dictionary body segnentation level data extracted only from a catalogue published by Etienne Charavay concerning an auctions sale that took place on December, 14th 1908. For the following levels, it contains the same data as the general model.

User guide

The protocol is described in detail in our user guide.

Examples / Data to play with

You can find in some pdf to test our models in the _example_examples folder.

Credits

GROBID-dictionaries is developed by Mohamed Khemakhem (GitHub). More info on GROBID technologies can be found here.

Licence

Regarding GROBID-dictionaries, cf. here.

Regarding the corpus: extracted data is CC-BY.

Creative Commons License

Cite this dataset

A first version of this dataset as been presetend at the TEI conference. If you use these data, please cite this paper:

@inproceedings{rondeaudunoyer:hal-02272962,
  AUTHOR = {Rondeau Du Noyer, Lucie and Gabay, Simon and Khemakhem, Mohamed and Romary, Laurent},
  TITLE = {Scaling up Automatic Structuring of Manuscript Sales Catalogues},
  ADDRESS = {Graz, Austria},
  MONTH = Sep,
  YEAR = {2019},
  BOOKTITLE = {TEI 2019: What is text, really? TEI and beyond},
  KEYWORDS = {Machine learning ; Manuscript sales catalogues ; 19th c. France},
  URL = {https://hal.inria.fr/hal-02272962},
  PDF = {https://hal.inria.fr/hal-02272962/file/Grobid%20Catalogues%20TEI%202019.pdf},
  HAL_ID = {hal-02272962},
  HAL_VERSION = {v1},
}

About

This repository contains everything we need for the data extraction.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages