Add documentation for DAN training

d26e8f41 · Solene Tarride · Mélodie Boillet · 986fdaf0 · d26e8f41 · 986fdaf0
Commit d26e8f41 authored 2 years ago by Solene Tarride Committed by Mélodie Boillet 2 years ago
--- a/docs/usage/index.md
+++ b/docs/usage/index.md
@@ -6,7 +6,7 @@ When `teklia-dan` is installed in your environment, you may use the following co
 : To preprocess datasets from Arkindex for training. More details in [the dedicated section](./datasets/index.md).

 `teklia-dan train`
-: To train a new DAN model. More details in [the dedicated section](./train.md).
+: To train a new DAN model. More details in [the dedicated section](./train/index.md).

 `teklia-dan generate`
 : To generate synthetic data to train DAN models. More details in [the dedicated section](./generate.md).

--- a/docs/usage/train.md
+++ b/docs/usage/train.md
-# Train
-
-Use the `teklia-dan train` command to train a new DAN model.
-
-Two subcommands are available depending on your dataset:
-
-`line`
-: Train a DAN model at line-level.
-
-`document`
-: Train a DAN model at document-level.
-
-## Remarks (for pre-training and training)
-All hyperparameters are specified and editable in the training scripts (meaning are in comments).
-
-Evaluation is performed just after training ending (training is stopped when the maximum elapsed time is reached or after a maximum number of epoch as specified in the training script).
-
-The outputs files are split into two subfolders:
-
-`checkpoints`
-: Contains model weights for the last trained epoch and for the epoch giving the best valid CER.
-
-`results`
-: Contains tensorboard log for loss and metrics as well as text file for used hyperparameters and results of evaluation.
--- a/docs/usage/train/index.md
+++ b/docs/usage/train/index.md
+# Train
+
+Use the `teklia-dan train` command to train a new DAN model.
+
+Two subcommands are available depending on your dataset:
+
+`line`
+: Train a DAN model at line-level and evaluate it.
+
+`document`
+: Train a DAN model at document-level and evaluate it.
+
+## Examples
+
+### Document
+
+To train DAN on documents:
+
+1. Set your training configuration in `dan/ocr/document/train.py`. Refer to the [dedicated section](parameters.md) for a description of parameters.
+2. Run `teklia-dan train document`.
+3. Look into evaluation results in the `output` folder:
+    * `checkpoints` contains model weights for the last trained epoch and for the epoch giving the best valid CER.
+    * `results` contains the tensorboard log file, the parameters file, and the evaluation results for the best epoch.
+
+### Line
+
+To train DAN on lines:
+
+1. Set your training configuration in `dan/ocr/line/train.py`. Refer to the [dedicated section](parameters.md) for a description of parameters.
+2. Run `teklia-dan train line`.
+3. Look into evaluation results in the `output` folder:
+    * `checkpoints` contains model weights for the last trained epoch and for the epoch giving the best valid CER.
+    * `results` contains the tensorboard log file, the parameters file, and the evaluation results for the best epoch.
+
+Note that it is possible to run  `teklia-dan train document` to train DAN on text lines. However, the configuration must be updated when training on synthetic documents.
+
+## Additional page
+
+* [Jean Zay tutorial](jeanzay.md)
--- a/docs/usage/train/jeanzay.md
+++ b/docs/usage/train/jeanzay.md
+# Training on Jean Zay
+
+See the [wiki](https://redmine.teklia.com/projects/research/wiki/Jean_Zay) for more details.
+
+## Run a training job
+Warning: there is no HTTP connection during a job.
+
+You can debug using an interactive job. The following command will get you a new terminal with 1 gpu for 1 hour: `srun --ntasks=1 --cpus-per-task=40 --gres=gpu:1 --time=01:00:00 --qos=qos_gpu-dev --pty bash -i`.
+
+You should run the actual training using a passive/batch job:
+* Run `sbatch train_dan.sh`.
+* The `train_dan.sh` file should look like the example below.
+
+```sh
+#!/bin/bash
+#SBATCH --constraint=v100-32g
+#SBATCH --qos=qos_gpu-t4                # partition
+#SBATCH --job-name=dan_training         # name of the job
+#SBATCH --gres=gpu:1                    # number of GPUs per node
+#SBATCH --cpus-per-task=10              # number of cores per tasks
+#SBATCH --hint=nomultithread            # we get physical cores not logical
+#SBATCH --distribution=block:block      # we pin the tasks on contiguous cores
+#SBATCH --nodes=1                       # number of nodes
+#SBATCH --ntasks-per-node=1             # number of MPI tasks per node
+#SBATCH --time=99:00:00                 # max exec time
+#SBATCH --output=dan_train_hugin_munin_page_%j.out         # output log file
+#SBATCH --error=dan_train_hugin_munin_page_%j.err          # error log file
+
+module purge                            # purging modules inherited by default
+module load anaconda-py3
+
+conda activate /gpfswork/rech/rxm/ubz97wr/.conda/envs/dan/
+
+# print started commands
+set -x
+
+# execution
+teklia-dan train document
+```
+
+## Supervise a job
+* Use `squeue -u $USER`. This command should give an output similar to the one presented below.
+```
+(base) [ubz97wr@jean-zay1: ubz97wr]$ squeue -u $USER
+             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
+           1762916   gpu_p13 pylaia_t  ubz97wr  R   23:07:54      1 r7i6n1
+           1762954   gpu_p13 pylaia_t  ubz97wr  R   22:35:57      1 r7i3n1
+```
+
+## Delete a job
+* Use `scancel $JOBID` to cancel a specific job.
+* Use `scancel -u $USER` to cancel all your jobs.
--- a/docs/usage/train/parameters.md
+++ b/docs/usage/train/parameters.md
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -59,7 +59,10 @@ nav:
      - usage/datasets/index.md
      - Dataset extraction: usage/datasets/extract.md
      - Dataset formatting: usage/datasets/format.md
-    - Training: usage/train.md
+    - Training:
+      - usage/train/index.md
+      - Parameters: usage/train/parameters.md
+      - Jean Zay tutorial: usage/train/jeanzay.md
    - Generate: usage/generate.md
    - Predict: usage/predict.md
  - Documentation development: dev/build_docs.md