diff --git a/docs/get_started/training.md b/docs/get_started/training.md
index 1cb5a0cffb55d6aaf9c9e2fee9fc65e433add1af..e28b6d314e73731d20d743dabd4d00857437b235 100644
--- a/docs/get_started/training.md
+++ b/docs/get_started/training.md
@@ -4,13 +4,15 @@ There are a several steps to follow when training a DAN model.
 
 ## 1. Extract data
 
-The data must be extracted and formatted for training. To extract the data, DAN uses an Arkindex export database in SQLite format. You will need to:
+To extract the data, DAN uses an Arkindex export database in SQLite format. You will need to:
 
 1. Structure the data into folders (`train` / `val` / `test`) in [Arkindex](https://demo.arkindex.org/).
 1. [Export the project](https://doc.arkindex.org/howto/export/) in SQLite format.
 1. Extract the data with the [extract command](../usage/datasets/extract.md).
 
-At the end, you should have a tree structure like this:
+This command will extract and format the images and labels needed to train DAN. It will also tokenize the training corpus at character, subword, and word levels, allowing you to combine DAN with an explicit statistical language model to improve performance.
+
+At the end, you should get the following tree structure:
 
 ```
 output/
@@ -21,8 +23,14 @@ output/
 â”‚   â”œâ”€â”€ val
 â”‚   â””â”€â”€ test
 â”œâ”€â”€ language_model
-â”‚   â”œâ”€â”€ corpus.txt
-â”‚   â”œâ”€â”€ lexicon.txt
+â”‚   â”œâ”€â”€ corpus_characters.txt
+â”‚   â”œâ”€â”€ lexicon_characters.txt
+â”‚   â”œâ”€â”€ corpus_subwords.txt
+â”‚   â”œâ”€â”€ lexicon_subwords.txt
+â”‚   â”œâ”€â”€ corpus_words.txt
+â”‚   â”œâ”€â”€ lexicon_words.txt
+â”‚   â”œâ”€â”€ subword_tokenizer.model
+â”‚   â”œâ”€â”€ subword_tokenizer.vocab
 â”‚   â””â”€â”€ tokens.txt
 ```
 
diff --git a/docs/usage/datasets/extract.md b/docs/usage/datasets/extract.md
index 116155b30da2ba4aa4913852c11ebab48f706a7d..f0f7113db4965f4ae67d6f9af8129dfb03f765b6 100644
--- a/docs/usage/datasets/extract.md
+++ b/docs/usage/datasets/extract.md
@@ -7,7 +7,7 @@ Use the `teklia-dan dataset extract` command to extract a dataset from an Arkind
 - Generate the images of each element (in the `images/` folder),
 - Create the mapping of the images (identified by its path) to the ground-truth transcription (with NER tokens if needed) (in the `labels.json` file),
 - Store the set of characters encountered in the dataset (in the `charset.pkl` file),
-- Generate the resources needed to build a N-gram language model with [kenlm](https://github.com/kpu/kenlm) (in the `language_model/` folder).
+- Generate the resources needed to build a n-gram language model at character, subword or word-level with [kenlm](https://github.com/kpu/kenlm) (in the `language_model/` folder).
 
 If an image download fails for whatever reason, it won't appear in the transcriptions file. The reason will be printed to stdout at the end of the process. Before trying to download the image, it checks that it wasn't downloaded previously. It is thus safe to run this command twice if a few images failed.
 
@@ -30,6 +30,7 @@ If an image download fails for whatever reason, it won't appear in the transcrip
 | `--keep-spaces`                  | Transcriptions are trimmed by default. Use this flag to disable this behaviour.                                                                                                                                                      | `bool`          | False                                              |
 | `--image-format`                 | Images will be saved under this format.                                                                                                                                                                                              | `str`           | `.jpg`                                             |
 | `--allow-empty`                  | Elements with no transcriptions are skipped by default. This flag disables this behaviour.                                                                                                                                           | `bool`          | False                                              |
+| `--subword-vocab-size`           | Size of the vocabulary used to train the sentencepiece subword tokenizer used to train the optional language model.                                                                                                                  | `int`           | `1000`                                             |
 
 The `--tokens` argument expects a YAML-formatted file with a specific format. A list of entries with each entry describing a NER entity. The label of the entity is the key to a dict mapping the starting and ending tokens respectively.
 
diff --git a/docs/usage/predict/examples.md b/docs/usage/predict/examples.md
index 7f7c0ed520bd6f2f525673d0d99b8eda19f77415..b7bc086c910af28cfcca51cd8e6397ec53bd91fd 100644
--- a/docs/usage/predict/examples.md
+++ b/docs/usage/predict/examples.md
@@ -147,38 +147,123 @@ It will create the following JSON file named `dan_humu_page/predict/example.json
 
 This example assumes that you have already [trained a language model](training_lm.md).
 
-First, update the `inference_parameters.yml` file obtained during DAN training. The `weight` parameter defines how much weight to give to the language model. It should be set carefully (usually between 0.5 and 2.0) as it will affect the quality of the predictions.
+Note that:
+
+- the `weight` parameter defines how much weight to give to the language model. It should be set carefully (usually between 0.5 and 2.0) as it will affect the quality of the predictions.
+- linebreaks are treated as spaces by language models, as a result predictions will not include linebreaks.
+
+#### Language model at character level
+
+First, update the `inference_parameters.yml` file obtained during DAN training.
 
 ```yaml
 parameters:
   ...
   language_model:
-    model: my_dataset/language_model/model.arpa
-    lexicon: my_dataset/language_model/lexicon.txt
+    model: my_dataset/language_model/model_characters.arpa
+    lexicon: my_dataset/language_model/lexicon_characters.txt
     tokens: my_dataset/language_model/tokens.txt
     weight: 0.5
 ```
 
+Note that the `weight` parameter defines how much weight to give to the language model. It should be set carefully (usually between 0.5 and 2.0) as it will affect the quality of the predictions.
+
 Then, run this command:
 
 ```shell
 teklia-dan predict \
-    --image dan_humu_page/example.jpg \
+    --image dan_humu_page/6e830f23-e70d-4399-8b94-f36ed3198575.jpg \
     --model dan_humu_page/model.pt \
-    --parameters dan_humu_page/parameters.yml \
+    --parameters dan_humu_page/inference_parameters_char_lm.yml \
     --charset dan_humu_page/charset.pkl \
     --use-language-model \
-    --output dan_humu_page/predict/
+    --output dan_humu_page/predict_char_lm/
 ```
 
-It will create the following JSON file named `dan_humu_page/predict/example.json`
+It will create the following JSON file named `dan_humu_page/predict_char_lm/6e830f23-e70d-4399-8b94-f36ed3198575.json`
+
+```json
+{
+  "text": "etc., some jeg netop idag\nholder Vask paa.\nLeien af Skj\u00f8rterne\nbestad i at jeg kj\u00f8bte\net Forkl\u00e6de til hver\naf de to Piger, some\nhavde laant os dem.\nResten var Vask af Hardan-\ngerskj\u00f8rter og et Forkl\u00e6de,\nsamt Fragt paa det Gods\n(N\u00f8i) some man sendte\nmig ubet\u00e6lt.\nIdag fik jeg hyggeligt\nFrimarkebrev fra Fosvold\nMed Hilsen\nDeres\nHulda Garborg",
+  "language_model": {
+    "text": "eet., some jeg netop idag holder Vask paa. Leien af Skj\u00f8rterne bestad i at jeg kj\u00f8bte et Forkl\u00e6de til hver af de to Piger, some havde laant os dem. Resten var Vask af Hardan- gerskj\u00f8rter og et Forkl\u00e6de, samt Fragt paa det Gods (T\u00f8i) some man sendte mig ubet\u00e6lt. Idag fik jeg hyggeligt Frimarkebrev fra Fosvold Med Hilsen Deres Hulda Garborg",
+    "confidence": 0.9
+  }
+}
+```
+
+#### Language model at subword level
+
+Update the `inference_parameters.yml` file obtained during DAN training.
+
+```yaml
+parameters:
+  ...
+  language_model:
+    model: my_dataset/language_model/model_subwords.arpa
+    lexicon: my_dataset/language_model/lexicon_subwords.txt
+    tokens: my_dataset/language_model/tokens.txt
+    weight: 0.5
+```
+
+Then, run this command:
+
+```shell
+teklia-dan predict \
+    --image dan_humu_page/6e830f23-e70d-4399-8b94-f36ed3198575.jpg \
+    --model dan_humu_page/model.pt \
+    --parameters dan_humu_page/inference_parameters_subword_lm.yml \
+    --charset dan_humu_page/charset.pkl \
+    --use-language-model \
+    --output dan_humu_page/predict_subword_lm/
+```
+
+It will create the following JSON file named `dan_humu_page/predict_subword_lm/6e830f23-e70d-4399-8b94-f36ed3198575.json`
+
+```json
+{
+  "text": "etc., some jeg netop idag\nholder Vask paa.\nLeien af Skj\u00f8rterne\nbestad i at jeg kj\u00f8bte\net Forkl\u00e6de til hver\naf de to Piger, some\nhavde laant os dem.\nResten var Vask af Hardan-\ngerskj\u00f8rter og et Forkl\u00e6de,\nsamt Fragt paa det Gods\n(N\u00f8i) some man sendte\nmig ubet\u00e6lt.\nIdag fik jeg hyggeligt\nFrimarkebrev fra Fosvold\nMed Hilsen\nDeres\nHulda Garborg",
+  "language_model": {
+    "text": "eet., some jeg netop idag holder Vask paa. Leien af Skj\u00f8rterne bestad i at jeg kj\u00f8bte et Forkl\u00e6de til hver af de to Piger, some havde laant os dem. Resten var Vask af Hardan- gerskj\u00f8rter og et Forkl\u00e6de, samt Fragt paa det Gods (T\u00f8i) some man sendte mig ubet\u00e6lt. Idag fik jeg hyggeligt Frim\u00e6rkebrev fra Fosvold Med Hilsen Deres Hulda Garborg",
+    "confidence": 0.84
+  }
+}
+```
+
+#### Language model at word level
+
+Update the `inference_parameters.yml` file obtained during DAN training.
+
+```yaml
+parameters:
+  ...
+  language_model:
+    model: my_dataset/language_model/model_words.arpa
+    lexicon: my_dataset/language_model/lexicon_words.txt
+    tokens: my_dataset/language_model/tokens.txt
+    weight: 0.5
+```
+
+Then, run this command:
+
+```shell
+teklia-dan predict \
+    --image dan_humu_page/6e830f23-e70d-4399-8b94-f36ed3198575.jpg \
+    --model dan_humu_page/model.pt \
+    --parameters dan_humu_page/inference_parameters_word_lm.yml \
+    --charset dan_humu_page/charset.pkl \
+    --use-language-model \
+    --output dan_humu_page/predict_word_lm/
+```
+
+It will create the following JSON file named `dan_humu_page/predict_word_lm/6e830f23-e70d-4399-8b94-f36ed3198575.json`
 
 ```json
 {
   "text": "etc., some jeg netop idag\nholder Vask paa.\nLeien af Skj\u00f8rterne\nbestad i at jeg kj\u00f8bte\net Forkl\u00e6de til hver\naf de to Piger, some\nhavde laant os dem.\nResten var Vask af Hardan-\ngerskj\u00f8rter og et Forkl\u00e6de,\nsamt Fragt paa det Gods\n(N\u00f8i) some man sendte\nmig ubet\u00e6lt.\nIdag fik jeg hyggeligt\nFrimarkebrev fra Fosvold\nMed Hilsen\nDeres\nHulda Garborg",
   "language_model": {
-    "text": "eet., some jeg netop idag\nholder Vask paa.\nLeien af Skj\u00f9rterne\nbestad i at jeg kj\u00f9bte\net Forkl\u00e7de til hver\naf de to Piger, some\nhavde laant os dem.\nResten var Vask af Hardan-\ngerskj\u00f9rter og et Forkl\u00e7de,\nsamt Fragt paa det Gods\n(N\u00f9i) some man sendte\nmig ubetalt.\nIdag fik jeg hyggeligt\nFrimarkebrev fra Fosvold\nMed Hilsen\nDeres\nHulda Garborg",
-    "confidence": 0.87
+    "text": "etc., some jeg netop idag holder Vask paa. Leien af Skj\u00f8rterne bestad i at jeg kj\u00f8bte et Forkl\u00e6de til hver af de to Piger, some havde laant os dem. Resten var Vask af Hardan- gerskj\u00f8rter og et Forkl\u00e6de, samt Fragt paa det Gods (T\u00f8i) some man sendte mig ubetalt. Idag fik jeg hyggeligt Frim\u00e6rkebrev fra Fosvold Med Hilsen Deres Hulda Garborg",
+    "confidence": 0.77
   }
 }
 ```
diff --git a/docs/usage/predict/training_lm.md b/docs/usage/predict/training_lm.md
index f44a4838283f0503ccaaa570174dc1df14caf82e..a7f6013b86a9bf34e959008574cc4698ade406ad 100644
--- a/docs/usage/predict/training_lm.md
+++ b/docs/usage/predict/training_lm.md
@@ -9,14 +9,18 @@ To build the language model, you first need to install and compile [kenlm](https
 
 ## Build the language model
 
-The `teklia-dan dataset extract` automatically generate the files required to train the language model in `my_dataset/language_model/`.
+The `teklia-dan dataset extract` automatically generate the files required to train a language model either at character, subword or word-level in `my_dataset/language_model/`.
 
-Use the following command to build a 6-gram language model:
+Note that linebreaks are replaced by spaces in the language model.
+
+### Character-level
+
+At character-level, we recommend building a 6-gram model. Use the following command:
 
 ```sh
 bin/lmplz --order 6 \
-    --text my_dataset/language_model/corpus.txt \
-    --arpa my_dataset/language_model/model.arpa
+    --text my_dataset/language_model/corpus_characters.txt \
+    --arpa my_dataset/language_model/model_characters.arpa
 ```
 
 The following message should be displayed if the language model was built successfully.
@@ -58,6 +62,26 @@ Chain sizes: 1:1308 2:27744 3:159140 4:412536 5:717920 6:1028896
 Name:lmplz	VmPeak:12643224 kB	VmRSS:6344 kB	RSSMax:1969316 kB	user:0.196445	sys:0.514686	CPU:0.711161	real:0.682693
 ```
 
+### Subord-level
+
+At subword-level, we recommend building a 6-gram model. Use the following command:
+
+```sh
+bin/lmplz --order 6 \
+    --text my_dataset/language_model/corpus_subwords.txt \
+    --arpa my_dataset/language_model/model_subwords.arpa
+```
+
+### Word-level
+
+At word-level, we recommend building a 3-gram model. Use the following command:
+
+```sh
+bin/lmplz --order 3 \
+    --text my_dataset/language_model/corpus_words.txt \
+    --arpa my_dataset/language_model/model_words.arpa
+```
+
 ## Predict with a language model
 
 See the [dedicated example](examples.md#predict-with-an-external-n-gram-language-model).