# Training an explicit language model DAN supports lattice rescoring using a statistical language model. This documentation gives instructions to build a language model with [kenlm](https://kheafield.com/code/kenlm/). Note that you can also use [SRILM](http://www.speech.sri.com/projects/srilm/). ## Install kenlm To build the language model, you first need to install and compile [kenlm](https://github.com/kpu/kenlm) by following the instructions detailed in the [README](https://github.com/kpu/kenlm#compiling). ## Build the language model The `teklia-dan dataset extract` automatically generate the files required to train a language model either at character, subword or word-level in `my_dataset/language_model/`. Note that linebreaks are replaced by spaces in the language model. ### Character-level At character-level, we recommend building a 6-gram model. Use the following command: ```sh bin/lmplz --order 6 \ --text my_dataset/language_model/corpus_characters.txt \ --arpa my_dataset/language_model/model_characters.arpa \ --discount_fallback ``` Note that the `--discount_fallback` option can be removed if your corpus is very large. The following message should be displayed if the language model was built successfully: ```sh === 1/5 Counting and sorting n-grams === Reading language_model/corpus.txt ----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100 **************************************************************************************************** Unigram tokens 111629 types 109 === 2/5 Calculating and sorting adjusted counts === Chain sizes: 1:1308 2:784852864 3:1471599104 4:2354558464 5:3433731328 6:4709116928 Statistics: 1 109 D1=0.586207 D2=0.534483 D3+=1.5931 2 1734 D1=0.538462 D2=1.09853 D3+=1.381 3 7957 D1=0.641102 D2=1.02894 D3+=1.37957 4 17189 D1=0.747894 D2=1.20483 D3+=1.41084 5 25640 D1=0.812458 D2=1.2726 D3+=1.57601 6 32153 D1=0.727411 D2=1.13511 D3+=1.42722 Memory estimate for binary LM: type kB probing 1798 assuming -p 1.5 probing 2107 assuming -r models -p 1.5 trie 696 without quantization trie 313 assuming -q 8 -b 8 quantization trie 648 assuming -a 22 array pointer compression trie 266 assuming -a 22 -q 8 -b 8 array pointer compression and quantization === 3/5 Calculating and sorting initial probabilities === Chain sizes: 1:1308 2:27744 3:159140 4:412536 5:717920 6:1028896 ----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100 #################################################################################################### === 4/5 Calculating and writing order-interpolated probabilities === Chain sizes: 1:1308 2:27744 3:159140 4:412536 5:717920 6:1028896 ----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100 #################################################################################################### === 5/5 Writing ARPA model === ----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100 **************************************************************************************************** Name:lmplz VmPeak:12643224 kB VmRSS:6344 kB RSSMax:1969316 kB user:0.196445 sys:0.514686 CPU:0.711161 real:0.682693 ``` ### Subword-level At subword-level, we recommend building a 6-gram model. Use the following command: ```sh bin/lmplz --order 6 \ --text my_dataset/language_model/corpus_subwords.txt \ --arpa my_dataset/language_model/model_subwords.arpa \ --discount_fallback ``` Note that the `--discount_fallback` option can be removed if your corpus is very large. ### Word-level At word-level, we recommend building a 3-gram model. Use the following command: ```sh bin/lmplz --order 3 \ --text my_dataset/language_model/corpus_words.txt \ --arpa my_dataset/language_model/model_words.arpa \ --discount_fallback ``` Note that the `--discount_fallback` option can be removed if your corpus is very large. ## Predict with a language model See the [dedicated example](../predict/index.md#predict-with-an-external-n-gram-language-model).