arpa2fst.py output an empty G_3_gram.fst.txt without any error messages #1877

MeeElves · 2025-02-15T17:35:51Z

Hi,

I use prepare_lm.sh to build HLG. However, the output file G_3_gram.fst.txt is empty.
I can convert vword.3gram.th1e-7.arpa to G.fst by using arpa2fst in kaldi.
but, a txt format G.fst is needed.

How to debug with arpa2fst in k2?

recipes in prepare_lm.sh:
...
mkdir -p data/lm
if [ ! -f data/lm/G_3_gram.fst.txt ]; then
# It is used in building HLG
python3 -m kaldilm
--read-symbol-table="data/lang_phone/words.txt"
--disambig-symbol='#0'
--max-order=3
$lm_word_dir/vword.3gram.th1e-7.arpa > data/lm/G_3_gram.fst.txt
...

vword.3gram.th1e-7.arpa looks like:
\data
ngram 1= 45342
ngram 2= 1110560
ngram 3= 342977

\1-grams:
-2.34093
-99 -1.42299
-6.19571 GAdigAlik -0.129765
-4.53936 GAlbA -0.289333
-5.16337 GAlbigA -0.218938
-5.03704 GAlbilik -0.226182
-3.77986 GAlibA -0.498246
-5.97037 GAlibAN -0.200701
-4.34576 GAlibilik -0.278693
-4.56942 GAlibini -0.501565
-5.00196 GAlibiseri -0.446753
-4.57569 GAlibisi -0.203198
-5.27412 GAlibisigA -0.173731
-4.6907 GAlibisini -0.279294
-3.78802 GAlitA -0.327788
-5.52259 GAlitirAk -0.155028
-4.53936 GAllA -0.434895
-5.24367 GAlwA -0.170739
-5.21522 GAlwir -0.132563
-6.06158 GAlwirdA -0.129765
-4.85956 GAlyan -0.365357
-3.91926 GAm -0.475768
-6.34759 GAmHoluqiGa
-6.34759 GAmHorloq
-4.46482 GAmHorluq -0.438952

csukuangfj · 2025-02-16T00:39:01Z

Does your words.txt match your arpa file, i.e., are words in the arpa file present in the words.txt?

MeeElves · 2025-02-16T01:15:12Z

Is there any tool that can check whether arpa file and words.txt are compatible?

I took a quick look and found that the items that appear in the arpa file can basically be found in words.txt.
However, the following items appear in words.txt but not in the arpa file：
<eps> 0
!SIL 1
<SPOKEN_NOISE> 2
<UNK> 3
...
#0 45345

MeeElves · 2025-02-16T01:16:34Z

it won't be closed.

MeeElves · 2025-02-16T02:53:49Z

I ask chagpt to write a script to check.
It is sure that words in the arpa file are also appears in words.txt.

MeeElves · 2025-02-17T07:44:35Z

Current situation:

The words that appear in the ARPA file also appear in the words.txt file. This has been tested with a script.

I have also tried the following:

Using fstprint to convert the FST generated by Kaldi to fst.txt, then generating HLG results in an OOM error, with memory usage exceeding 400GB:
```
./local/compile_hlg.py --lang-dir data/lang_phone
```

Switched to:

./local/compile_hlg_using_openfst.py --lang-dir data/lang_phone

This results in the following error:

kaldifst.determinize_star(LG)
RuntimeError: /project/kaldifst/csrc/determinize-star-inl.h:void fst::DeterminizerStar::ProcessTransition(fst::DeterminizerStar::OutputStateId, fst::DeterminizerStar::Label, std::vector::Element>*) [with F = fst::VectorFst > >; fst::DeterminizerStar::OutputStateId = int; fst::DeterminizerStar::Label = int]:971
[E] FST was not functional -> not determinizable

Since I have no experience building language models, I would appreciate some advice, thank you [emoji].

The dataset and LM model files run without issues in Kaldi. I originally thought the migration to K2 would be a simple re-run, but there are still many problems in the migration process. It is possible that the ARPA file has issues, but the _kaldilm.cpython-312-x86_64-linux-gnu.so _kaldilm.arpa2fst only outputs an empty fst.txt file, with no other error messages. Where is the corresponding source code for this? Is it possible to debug it?
If I plan to rebuild the language model from the corpus text, which approach would be the best?

MeeElves closed this as completed Feb 16, 2025

MeeElves reopened this Feb 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

arpa2fst.py output an empty G_3_gram.fst.txt without any error messages #1877

arpa2fst.py output an empty G_3_gram.fst.txt without any error messages #1877

MeeElves commented Feb 15, 2025 •

edited

Loading

csukuangfj commented Feb 16, 2025

MeeElves commented Feb 16, 2025 •

edited

Loading

MeeElves commented Feb 16, 2025

MeeElves commented Feb 16, 2025

MeeElves commented Feb 17, 2025 •

edited

Loading

arpa2fst.py output an empty G_3_gram.fst.txt without any error messages #1877

arpa2fst.py output an empty G_3_gram.fst.txt without any error messages #1877

Comments

MeeElves commented Feb 15, 2025 • edited Loading

csukuangfj commented Feb 16, 2025

MeeElves commented Feb 16, 2025 • edited Loading

MeeElves commented Feb 16, 2025

MeeElves commented Feb 16, 2025

MeeElves commented Feb 17, 2025 • edited Loading

MeeElves commented Feb 15, 2025 •

edited

Loading

MeeElves commented Feb 16, 2025 •

edited

Loading

MeeElves commented Feb 17, 2025 •

edited

Loading