Conversion scripts between CHILDES CHAT format and UD CoNLL-U format.
- repetitive feature strings in Italian Corpus
- header organisation needs to be revised
Python Version >= 3.6
I'm using poetry for dependency management. Follow these instructions to install poetry.
After cloning this repo, change directory to the chatconllu
folder with Makefile
:
cd chatconllu
make install
make installchatconllu
chatconllu <CHILDES databases dir> <database name(s)>
Example
If your Brown Corpus is stored in ./tests/eng/Brown
and you wish to convert only the .cha files in Adam
and Eve
but not Sarah
, you should use the following command:
chatconllu ./tests/eng/Brown Adam Eve
The output .conllu files will be in the same folder.
Use -f
or --format
to specify the input format, defualts to cha
, accepts cha
and conllu
.
chatconllu <CHILDES databases dir> -f conllu <database name(s)>
Example
If you wish to convert only the .conllu files back, use:
chatconllu ./tests/eng/Brown -f conllu Adam Eve
The output .cha files will NOT be in the same folder, they will appear in out/
If you'd like to disregard the %mor
(--no-mor
) or %gra
(--no-gra
) tiers (or both) and mute the MISC
field (--no-misc
), try:
chatconllu <CHILDES databases dir> <database name(s)> --no-mor --no-gra --no-misc
If you, for some reason, would like to generate a new (and empty) %mor
(--new-mor
) or %gra
(--new-gra
) tiers (or both), try:
chatconllu <CHILDES databases dir> <database name(s)> --new-mor --new-gra
Empty values are represented by _
.
However, if you pass .conllu files through UDPipe and want to generate dependent tiers based on the augmented information, you could use:
chatconllu <CHILDES databases dir> <database name> -f conllu -fn <processed conllu file> --cnl --pos
-fn
: specifies a filename (without extension)--cnl
: generates a%cnl
tier, handles syntax (dependency relations), it's similar to%gra
--pos
: generates a%pos
tier, handles morphology (without features), it's similar to%mor
- install CLAN
- open .cha file with CLAN
- run CHECK
Prerequisites
- at least Java 8
- download
chatter.jar
and follow the instructions here. - commandline:
java -cp <path to chatter.jar> org.talkbank.chatter.App -inputFormat cha -outputFormat xml -tree <cha files dir> -outputDir <output dir>
Prerequisites
- clone UniversalDependencies/tools/ or download UniversalDependencies/tools/data/.
- use
validate.py
from UniversalDependencies/tools/validate.py.
python <path to validate.py> --lang <2-letter language code> --level <level from 1 to 5> <path to conllu file>
Current state of chatconllu supports max level-2 tests (tested on English with Brown Corpus).