This repository contains digitised manuscripts sale catalogs encoded in XML-TEI at level 1.
The data have not been cleaned (level 2) or post-processed (level 3).
Basic bibliographic information for each catalogue are available here.
You can find the ODD that validates the encoding in the repository Data_extraction (folder _schemas
).
The creation process is described in detail in the following repo.
Entries of catalogues look like the following:
<item n="80" xml:id="CAT_000146_e80">
<num>80</num>
<name type="author">Cherubini (L.),</name>
<trait>
<p>l'illustre compositeur</p>
</trait>
<desc>L. a s.; 1836, 1 p 1 /2 in8.</desc>
<measure commodity="currency" unit="FRF" quantity="12">12</measure>
</item>
Most of the reconciliation process uses data from the <desc>
element of our xml files. We therefore need to correct typos to ease further post-processing, e.g.
L. a s.
->L. a. s.
in8
->in-8
1 /2
->1/2
1 p
->1 p.
The clean_xml.py
script available here tackles this problem.
* git clone https://github.com/katabase/1_OutputData.git
* cd 1_OutputData
* python3 -m venv my_env
* source my_env/bin/activate
* pip install -r requirements.txt
* python script/clean_xml.py -f FILENAME processes one single file
OR
* python script/clean_xml.py -d DIRECTORY processes all the files contained in a directory
- The ODD was created by Lucie Rondeau du Noyer.
clean_xml.py
was created by Simon Gabay.- The catalogs were encoded by Lucie Rondeau du Noyer, Simon Gabay, Matthias Gille Levenson, Ljudmila Petkovic and Alexandre Bartz.
Alexandre Bartz, Simon Gabay, Matthias Gille Levenson, Ljudmila Petkovic and Lucie Rondeau du Noyer, Manuscript sale catalogues, Neuchâtel: Université de Neuchâtel, 2019, https://github.com/katabase/1_OutputData.
The catalogues are licensed under Creative Commons Attribution 4.0 International Licence and the code is licensed under GNU GPL-3.0.