You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In parlamint2distro.pl, I want to call this metadata extraction separately, because it needs a different setup (less jobs in paralel), especially for ParlaMint-IL, it is not possible to run it in 60 jobs, because it runs out of memory (all 45k files need to open taxonomies and particDesc files, that are extremely large). @TomazErjavec, should it be backwards compatible? I can add an option -no-meta to the script, so that extracting metadata will not be called when it is present.
I can then try to speed up the process:
print all translations in one run
process multiple component files in one run (chunking), so header files will be parsed fewer times
@TomazErjavec, please let me know, what you think about this, I will then implement it in #894
The text was updated successfully, but these errors were encountered:
Before I try to speed up the process, I need to factorize it out
If I understand correctly, you will make a new script, say parlamintp-tei2meta.pl, put this code there and then call the original script + parlamintp-tei2meta.pl from parlamint2distro.pl. Which is certainly a good idea, better than having the same code twice in different scripts.
I don't quite understand what you mean by this, but my inuitive answer is "no". If we can run the new scripts and get the same result as with the old ones, that is quite ok.
I can add an option -no-meta to the script, so that extracting metadata will not be called when it is present.
Hm, I don't see the need for that. It might even be dangerous, is it might (although probably won't) happen that the metadata is present, but from some previous version.
@TomazErjavec, please let me know, what you think about this
the metadata extraction is now in Scripts/parlamintp-tei2meta.pl
It uses -inRoot parameter instead of the input directory, so I placed dirification before running the script in samples:
currently, an extraction of metadata to TSV appears multiple times in the code:
ParlaMint/Scripts/parlamintp-tei2text.pl
Lines 51 to 75 in 2de4c7c
ParlaMint/Scripts/parlamintp2conllu.pl
Lines 107 to 124 in 2de4c7c
Before I try to speed up the process, I need to
In
parlamint2distro.pl
, I want to call this metadata extraction separately, because it needs a different setup (less jobs in paralel), especially for ParlaMint-IL, it is not possible to run it in 60 jobs, because it runs out of memory (all 45k files need to open taxonomies and particDesc files, that are extremely large).@TomazErjavec, should it be backwards compatible? I can add an option
-no-meta
to the script, so that extracting metadata will not be called when it is present.I can then try to speed up the process:
@TomazErjavec, please let me know, what you think about this, I will then implement it in #894
The text was updated successfully, but these errors were encountered: