Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactorization and speedup: metadata extraction #896

Open
1 of 3 tasks
matyaskopp opened this issue Feb 5, 2025 · 2 comments
Open
1 of 3 tasks

refactorization and speedup: metadata extraction #896

matyaskopp opened this issue Feb 5, 2025 · 2 comments
Assignees
Labels
enhancement New feature or request

Comments

@matyaskopp
Copy link
Collaborator

matyaskopp commented Feb 5, 2025

currently, an extraction of metadata to TSV appears multiple times in the code:

print STDERR "INFO: Making metadata files\n";
opendir(CORPUSDIR, $inDir);
@rootFile = grep {/ParlaMint-[A-Z]{2}(?:-[A-Z0-9]{1,3})?(?:-[a-z]{2,3})?(\.ana)?\.xml$/} readdir(CORPUSDIR);
closedir(CORPUSDIR);
#For MTed corpora output only en metadata, for native, both xx and en
if ($MT) {@outLangs = ('en')} else {@outLangs = ('xx', 'en')}
# For orig corpora make ParlaMint-XX-meta.tsv in corpus language and ParlaMint-XX-meta-en.tsv in English
# For MTed corpora we produce ParlaMint-XX-en-meta.tsv in English
foreach my $outLang (@outLangs) {
my $outSuffix;
if ($MT and $outLang eq 'xx') {}
elsif ($MT and $outLang eq 'en') {$outSuffix = "-meta.tsv"}
elsif ($outLang eq 'xx') {$outSuffix = "-meta.tsv"}
elsif ($outLang eq 'en') {$outSuffix = "-meta-en.tsv"}
if ($outSuffix) {
$command = "$Saxon" .
" meta=" . File::Spec->catfile($inDir,$rootFile[0]) .
" out-lang=$outLang" .
" -xsl:$scriptMeta {} > $outDir/{/.}$outSuffix";
`cat $fileFile | $Para '$command'`;
# The rm following looks like a bug, as no TSV files are left if we are processing only .ana!
#`rm -f $outDir/*.ana-meta.tsv`;
}
}
`rename 's/\.ana//' $outDir/*-meta*.tsv`;

#For MTed corpora output only en metadata, for native, both xx and en
if ($MT) {@outLangs = ('en')} else {@outLangs = ('xx', 'en')}
# For orig corpora make ParlaMint-XX-meta.tsv in corpus language and ParlaMint-XX-meta-en.tsv in English
# For MTed corpora we produce ParlaMint-XX-en-meta.tsv in English
foreach my $outLang (@outLangs) {
my $outSuffix;
if ($MT and $outLang eq 'xx') {}
elsif ($MT and $outLang eq 'en') {$outSuffix = "-meta.tsv"}
elsif ($outLang eq 'xx') {$outSuffix = "-meta.tsv"}
elsif ($outLang eq 'en') {$outSuffix = "-meta-en.tsv"}
if ($outSuffix) {
$command = "$Saxon meta=$rootAnaFile" .
" out-lang=$outLang" .
" -xsl:$scriptMeta {} > $outDir/{/.}$outSuffix";
`cat $fileFile | $Para '$command'`;
}
}
`rename 's/\.ana//' $outDir/*-meta*.tsv`;

Before I try to speed up the process, I need to

  • factorize it out

In parlamint2distro.pl, I want to call this metadata extraction separately, because it needs a different setup (less jobs in paralel), especially for ParlaMint-IL, it is not possible to run it in 60 jobs, because it runs out of memory (all 45k files need to open taxonomies and particDesc files, that are extremely large).
@TomazErjavec, should it be backwards compatible? I can add an option -no-meta to the script, so that extracting metadata will not be called when it is present.

I can then try to speed up the process:

  • print all translations in one run
  • process multiple component files in one run (chunking), so header files will be parsed fewer times

@TomazErjavec, please let me know, what you think about this, I will then implement it in #894

@matyaskopp matyaskopp added the enhancement New feature or request label Feb 5, 2025
@matyaskopp matyaskopp self-assigned this Feb 5, 2025
@TomazErjavec
Copy link
Collaborator

Before I try to speed up the process, I need to factorize it out

If I understand correctly, you will make a new script, say parlamintp-tei2meta.pl, put this code there and then call the original script + parlamintp-tei2meta.pl from parlamint2distro.pl. Which is certainly a good idea, better than having the same code twice in different scripts.

@TomazErjavec, should it be backwards compatible?

I don't quite understand what you mean by this, but my inuitive answer is "no". If we can run the new scripts and get the same result as with the old ones, that is quite ok.

I can add an option -no-meta to the script, so that extracting metadata will not be called when it is present.

Hm, I don't see the need for that. It might even be dangerous, is it might (although probably won't) happen that the metadata is present, but from some previous version.

@TomazErjavec, please let me know, what you think about this

I think it is a good idea.

@matyaskopp
Copy link
Collaborator Author

the metadata extraction is now in Scripts/parlamintp-tei2meta.pl
It uses -inRoot parameter instead of the input directory, so I placed dirification before running the script in samples:

&dirify($outSmpDir);
`$scriptMetas -jobs $procThreads -inRoot $outTeiSmpRoot -out $outSmpDir`;

&dirify($outSmpDir);
`$scriptMetas -jobs $procThreads -inRoot $outAnaSmpRoot -out $outSmpDir` unless $outTeiRoot;

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants