UTILS - general use tools for the katabase project

Description of the tools

autopipeline.sh : a bash script to run the whole pipeline. The script checks if the directories used as inputs and outputs exists ; if output directories exist, a message informs the user that data might be overwritten and offers to delete those directories. Be careful not to delete anything too important, however.

it is not up to date and dates from before 3_WikidataEnrichment was created

how to use it

 mkdir katabase  # create a folder to contain all repositories
 cd katabase  # move in the proper directory
 # clone all necessary repositories
 git clone https://github.com/katabase/utils
 git clone https://github.com/katabase/1_OutputData.git
 git clone https://github.com/katabase/2_CleanedData.git
 git clone https://github.com/katabase/3_TaggedData.git
 git clone https://github.com/katabase/Application.git
 python3 -m venv env  # create a python virtual environment
 source env/bin/activate  # activate the virtualenv
 cd utils  # move in the utils directory
 pip install -r req_full.txt  # install the necessary librairies
 bash autopipeline.sh  # launch the script

reorder.sh : a bash script to move XML catalogues (CAT_*.xml) in the proper directories (1-101, 200-201...) based on their id. The script is supposed to be usable in all steps of the Katabase pipeline

example : CAT_000176.xml will be moved in a directory named 101-200 and so on.
the script checks if the destination directory exists ; if not, it creates it and moves the file there.
it also the location of all CAT_*.xml files and moves them to the proper directory if needed.

how to

 cp utils/reorder.sh 1_OutputData  # copy the script to the directory you want to use it in (1_OutputData, 2_CleanData, 3_TaggedData)
 cd 1_OutputData  # move in the directory you'll be using the script in
 bash reorder.sh

rename_escriptorium.sh : a bash script to rename xml and png files downloaded from eScriptorium.

functionning : the files downloaded from eScriptorium all follow this structure: filename_of_file_uploaded_to_escriptorium_page_N.xml. we rename the files by changing the input filename to an identifier chosen by the user and modifying the way a page number is written.
example: CAT_000432.pdf_page_1.xml becomes 1890_01_16_CHA_001.xml

how to:

 # be in a directory with all the escriptorium files and this script
 bash rename_escriptorium.sh

validator.py : a python command line interface to validate and correct the XML files in New_OutputData. Those files are not clean; some of them aren't following the specifications of the ODD and are this not valid. Two commands exist:

errlogger checks the validity of the files against the ODD specification in RNG format (_schemas/odd_katabase.rng)
corrector prompts the user to give the missing information ; if the files are "problematic" (they can't easily be corrected from the CLI), they are moved to out_a_corriger ; if the files are valid from the start and corrected by the user, they are moved to out_clean
before using this script, several enhancements are necessary :
- allow a tei:item to have more than one tei:desc : currently, if an item has more than one tei:desc, it is moved to out_a_corriger, despite this being a valid situation. instead, if a tei:item//tei:name has no @type attribute, all the tei:descs should be printed before the user is prompted to give an @type attribute (faulty line : if len(name) != len(context):)
- if tei:bibl//tei:date`` is empty, prompt the user to add a date using the @whenor@fromand@to` of this element (no date causes an error when launching the website)

how to

cp utils/validator.py New_OutputData  # copy the script in the proper directory
cd New_OutputData  # move to the directory
python validator.py errlogger  # if you want to check the file's validity
python validator.py corrector  # if you want to correct the files

jsontocsv.py : a python script to transform export.json (the json file obtained at the end of step 3_TaggedData) in CSV format. export.json needs to be in the same folder as this script to work.

how to

     # have `jsontocsv.py` and `export.json` in the same directory
     python jsontocsv.py

nametable.py : a python script to build a csv of names in the corpus in order to align names with a wikidata id.

tsv structure:
- xml id : the @xml:id of the tei:item in which the name is found
- wikidata id : the wikidata identifier of a person / subject
- name : the tei:name in catalogue entries: the tei:name can be the name of a person, but also a historical period, a subject...
- trait : the tei:trait element, used to describe the information in tei:name
- the same names can and will be found several times in the different catalogues

how to:

expected file structure:

 root_directory/
  |_utils/
  |  |_nametable.py
  |_1_OutputData/
     |_*0*  # the catalogues folders: 1-100, 101-200...

run the script:

 cd utils
 python nametable.py

rm_suffix.sh : delete the suffixes from all xml catalogues in a directory (CAT_000101_wd.xml => CAT_000101.xml)

how to: once you are in a directory with xml catalogue files:
```
 bash rm_suffix.sh
```

full_requirements.txt : a list of python packages to be able to work on the whole pipeline (by creating a single python virutalenv for the 4 first steps, the web application, all scripts on the utils and visualisation repositories).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

UTILS - general use tools for the katabase project

Description of the tools

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
nametable_out		nametable_out
LICENSE		LICENSE
README.md		README.md
autopipeline.sh		autopipeline.sh
full_requirements.txt		full_requirements.txt
jsontocsv.py		jsontocsv.py
nametable.py		nametable.py
rename_escriptorium.sh		rename_escriptorium.sh
reorder.sh		reorder.sh
rm_suffix.sh		rm_suffix.sh
validator.py		validator.py

License

katabase/utils

Folders and files

Latest commit

History

Repository files navigation

UTILS - general use tools for the katabase project

Description of the tools

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages