Cleaned Data - level 2

This repository contains digitised manuscripts sale catalogs encoded in XML-TEI at level 2.

The data have been cleaned (level 2) but not post-processed (level 3) yet.

Schema

You can find the ODD that validates the encoding in the repository Data_extraction (folder _schemas).

Workflow

Once the data have been cleaned, we can start to extract information from the desc.

extractor-xml.py extracts informations and then retrieves them in the same XML file (level 3).

The script transforms this

<item n="80" xml:id="CAT_000146_e80">
   <num>80</num>
   <name type="author">Cherubini (L.),</name>
   <trait>
      <p>l'illustre compositeur</p>
   </trait>
   <desc>L. a. s.; 1836, 1 p. in-8.</desc>
    <measure commodity="currency" unit="FRF" quantity="12">12</measure>
</item>

into

<item n="80" xml:id="CAT_000146_e80">
   <num>80</num>
   <name type="author">Cherubini (L.),</name>
   <trait>
      <p>l'illustre compositeur</p>
   </trait>
   <desc>
      <term>L. a. s.</term>;<date>1836</date>,
   	<measure type="length" unit="p" n="1">1 p.</measure> 
   	<measure unit="f" type="format" n="8">in-8</measure>.
   	<measure commodity="currency" unit="FRF" quantity="12">12</measure>
   </desc>
</item>

To carry this task we use extractor_xml.py [available here].

Installation and use

* git clone https://github.com/katabase/2_CleanedData.git
* cd 2_CleanedData
* python3 -m venv my_env
* source my_env/bin/activate
* pip install -r requirements.txt
* cd script 
* python3 extractor_xml.py directory_to_process

Note that you have to be in the folder scriptto execute extractor_xml.py and that the script only works with filenames ending with _clean.xml (files must have been beforehand cleaned).

The output files will be in the folder output.

Credits

Scripts were created by Matthias Gille Levenson and improved by Alexandre Bartz with the help of Simon Gabay.
The catalogs were encoded by Lucie Rondeau du Noyer, Simon Gabay, Matthias Gille Levenson, Ljudmila Petkovic and Alexandre Bartz.

Cite this repository

Alexandre Bartz, Simon Gabay, Matthias Gille Levenson, Ljudmila Petkovic and Lucie Rondeau du Noyer, Manuscript sale catalogues, Neuchâtel: Université de Neuchâtel, 2020, https://github.com/katabase/2_CleanedData.

Licence

The catalogues are licensed under Creative Commons Attribution 4.0 International Licence and the code is licensed under GNU GPL-3.0.

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
1-100		1-100
101-200		101-200
201-300		201-300
301-400		301-400
401-500		401-500
output		output
script		script
.gitignore		.gitignore
LICENSE		LICENSE
LICENSE_CATALOGUES		LICENSE_CATALOGUES
README.md		README.md
corpus.xml		corpus.xml
corpus_out.xml		corpus_out.xml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cleaned Data - level 2

Schema

Workflow

Installation and use

Credits

Cite this repository

Licence

About

Releases

Packages

Contributors 3

Languages

License

katabase/2_CleanedData

Folders and files

Latest commit

History

Repository files navigation

Cleaned Data - level 2

Schema

Workflow

Installation and use

Credits

Cite this repository

Licence

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages