Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

arpa2fst.py output an empty G_3_gram.fst.txt without any error messages #1877

Open
MeeElves opened this issue Feb 15, 2025 · 5 comments
Open

Comments

@MeeElves
Copy link

MeeElves commented Feb 15, 2025

Hi,

I use prepare_lm.sh to build HLG. However, the output file G_3_gram.fst.txt is empty.
I can convert vword.3gram.th1e-7.arpa to G.fst by using arpa2fst in kaldi.
but, a txt format G.fst is needed.

How to debug with arpa2fst in k2?

recipes in prepare_lm.sh:
...
mkdir -p data/lm
if [ ! -f data/lm/G_3_gram.fst.txt ]; then
# It is used in building HLG
python3 -m kaldilm
--read-symbol-table="data/lang_phone/words.txt"
--disambig-symbol='#0'
--max-order=3
$lm_word_dir/vword.3gram.th1e-7.arpa > data/lm/G_3_gram.fst.txt
...

vword.3gram.th1e-7.arpa looks like:
\data
ngram 1= 45342
ngram 2= 1110560
ngram 3= 342977

\1-grams:
-2.34093
-99 -1.42299
-6.19571 GAdigAlik -0.129765
-4.53936 GAlbA -0.289333
-5.16337 GAlbigA -0.218938
-5.03704 GAlbilik -0.226182
-3.77986 GAlibA -0.498246
-5.97037 GAlibAN -0.200701
-4.34576 GAlibilik -0.278693
-4.56942 GAlibini -0.501565
-5.00196 GAlibiseri -0.446753
-4.57569 GAlibisi -0.203198
-5.27412 GAlibisigA -0.173731
-4.6907 GAlibisini -0.279294
-3.78802 GAlitA -0.327788
-5.52259 GAlitirAk -0.155028
-4.53936 GAllA -0.434895
-5.24367 GAlwA -0.170739
-5.21522 GAlwir -0.132563
-6.06158 GAlwirdA -0.129765
-4.85956 GAlyan -0.365357
-3.91926 GAm -0.475768
-6.34759 GAmHoluqiGa
-6.34759 GAmHorloq
-4.46482 GAmHorluq -0.438952

@csukuangfj
Copy link
Collaborator

Does your words.txt match your arpa file, i.e., are words in the arpa file present in the words.txt?

@MeeElves
Copy link
Author

MeeElves commented Feb 16, 2025

Is there any tool that can check whether arpa file and words.txt are compatible?

I took a quick look and found that the items that appear in the arpa file can basically be found in words.txt.
However, the following items appear in words.txt but not in the arpa file:
<eps> 0
!SIL 1
<SPOKEN_NOISE> 2
<UNK> 3
...
#0 45345

@MeeElves
Copy link
Author

it won't be closed.

@MeeElves MeeElves reopened this Feb 16, 2025
@MeeElves
Copy link
Author

I ask chagpt to write a script to check.
It is sure that words in the arpa file are also appears in words.txt.

@MeeElves
Copy link
Author

MeeElves commented Feb 17, 2025

Current situation:

The words that appear in the ARPA file also appear in the words.txt file. This has been tested with a script.

I have also tried the following:

  1. Using fstprint to convert the FST generated by Kaldi to fst.txt, then generating HLG results in an OOM error, with memory usage exceeding 400GB:

    ./local/compile_hlg.py --lang-dir data/lang_phone
    
  2. Switched to:

    ./local/compile_hlg_using_openfst.py --lang-dir data/lang_phone
    

    This results in the following error:

    kaldifst.determinize_star(LG)
    RuntimeError: /project/kaldifst/csrc/determinize-star-inl.h:void fst::DeterminizerStar::ProcessTransition(fst::DeterminizerStar::OutputStateId, fst::DeterminizerStar::Label, std::vector::Element>*) [with F = fst::VectorFst > >; fst::DeterminizerStar::OutputStateId = int; fst::DeterminizerStar::Label = int]:971
    [E] FST was not functional -> not determinizable
    

Since I have no experience building language models, I would appreciate some advice, thank you [emoji].

  1. The dataset and LM model files run without issues in Kaldi. I originally thought the migration to K2 would be a simple re-run, but there are still many problems in the migration process. It is possible that the ARPA file has issues, but the _kaldilm.cpython-312-x86_64-linux-gnu.so _kaldilm.arpa2fst only outputs an empty fst.txt file, with no other error messages. Where is the corresponding source code for this? Is it possible to debug it?

  2. If I plan to rebuild the language model from the corpus text, which approach would be the best?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants