Skip to content
Snippets Groups Projects
language_model.md 4.07 KiB
Newer Older
# Training an explicit language model

DAN supports lattice rescoring using a statistical language model.
This documentation gives instructions to build a language model with [kenlm](https://kheafield.com/code/kenlm/). Note that you can also use [SRILM](http://www.speech.sri.com/projects/srilm/).

## Install kenlm

To build the language model, you first need to install and compile [kenlm](https://github.com/kpu/kenlm) by following the instructions detailed in the [README](https://github.com/kpu/kenlm#compiling).

## Build the language model

The `teklia-dan dataset extract` automatically generate the files required to train a language model either at character, subword or word-level in `my_dataset/language_model/`.
Note that linebreaks are replaced by spaces in the language model.

### Character-level

At character-level, we recommend building a 6-gram model. Use the following command:

```sh
bin/lmplz --order 6 \
    --text my_dataset/language_model/corpus_characters.txt \
    --arpa my_dataset/language_model/model_characters.arpa \
    --discount_fallback
Note that the `--discount_fallback` option can be removed if your corpus is very large.

Manon Blanco's avatar
Manon Blanco committed
The following message should be displayed if the language model was built successfully:

```sh
=== 1/5 Counting and sorting n-grams ===
Manon Blanco's avatar
Manon Blanco committed
Reading language_model/corpus.txt
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Unigram tokens 111629 types 109
=== 2/5 Calculating and sorting adjusted counts ===
Chain sizes: 1:1308 2:784852864 3:1471599104 4:2354558464 5:3433731328 6:4709116928
Statistics:
1 109 D1=0.586207 D2=0.534483 D3+=1.5931
2 1734 D1=0.538462 D2=1.09853 D3+=1.381
3 7957 D1=0.641102 D2=1.02894 D3+=1.37957
4 17189 D1=0.747894 D2=1.20483 D3+=1.41084
5 25640 D1=0.812458 D2=1.2726 D3+=1.57601
6 32153 D1=0.727411 D2=1.13511 D3+=1.42722
Memory estimate for binary LM:
type      kB
probing 1798 assuming -p 1.5
probing 2107 assuming -r models -p 1.5
trie     696 without quantization
trie     313 assuming -q 8 -b 8 quantization
trie     648 assuming -a 22 array pointer compression
trie     266 assuming -a 22 -q 8 -b 8 array pointer compression and quantization
=== 3/5 Calculating and sorting initial probabilities ===
Chain sizes: 1:1308 2:27744 3:159140 4:412536 5:717920 6:1028896
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
####################################################################################################
=== 4/5 Calculating and writing order-interpolated probabilities ===
Chain sizes: 1:1308 2:27744 3:159140 4:412536 5:717920 6:1028896
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
####################################################################################################
=== 5/5 Writing ARPA model ===
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Name:lmplz	VmPeak:12643224 kB	VmRSS:6344 kB	RSSMax:1969316 kB	user:0.196445	sys:0.514686	CPU:0.711161	real:0.682693
```

At subword-level, we recommend building a 6-gram model. Use the following command:

```sh
bin/lmplz --order 6 \
    --text my_dataset/language_model/corpus_subwords.txt \
    --arpa my_dataset/language_model/model_subwords.arpa \
    --discount_fallback
Note that the `--discount_fallback` option can be removed if your corpus is very large.

### Word-level

At word-level, we recommend building a 3-gram model. Use the following command:

```sh
bin/lmplz --order 3 \
    --text my_dataset/language_model/corpus_words.txt \
    --arpa my_dataset/language_model/model_words.arpa \
    --discount_fallback
Note that the `--discount_fallback` option can be removed if your corpus is very large.

## Predict with a language model

See the [dedicated example](../predict/index.md#predict-with-an-external-n-gram-language-model).