Snippets Groups Projects

Update doc with `--discount_fallback` option for LM

The command used to build LM can fail if the text corpus is too small. In this case, we should add the --discount_fallback option (see here). We need to add this information in DAN's documentation.

 bin/lmplz --order 3 --text ../../dan/data/madcat/language_model/corpus_words.txt --arpa ../../dan/data/madcat/model_words.arpa
=== 1/5 Counting and sorting n-grams ===
Reading corpus_words.txt
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Unigram tokens 1356499 types 36791
=== 2/5 Calculating and sorting adjusted counts ===
Chain sizes: 1:441492 2:4435973632 3:8317450752
/home/mboillet/Desktop/Git/kenlm/lm/builder/adjust_counts.cc:49 in void lm::builder::{anonymous}::StatCollector::CalculateDiscounts(const lm::builder::DiscountConfig&) threw BadDiscountException because `s.n[j] == 0'.
Could not calculate Kneser-Ney discounts for 1-grams with adjusted count 4 because we didn't observe any 1-grams with adjusted count 3; Is this small or artificial data?
Try deduplicating the input.  To override this error for e.g. a class-based model, rerun with --discount_fallback

Aborted (core dumped)

Designs

Child items ...

Activity

Solene Tarride added Documentation label 1 year ago

added Documentation label
Solene Tarride assigned to @starride 1 year ago

assigned to @starride
Solene Tarride created branch update-doc-discount_fallback to address this issue 1 year ago

created branch update-doc-discount_fallback to address this issue
Solene Tarride mentioned in merge request !304 (merged) 1 year ago

mentioned in merge request !304 (merged)
Mélodie Boillet closed with merge request !304 (merged) 1 year ago

closed with merge request !304 (merged)

Please register or sign in to reply