Skip to content
Snippets Groups Projects
Commit d26e8f41 authored by Solene Tarride's avatar Solene Tarride Committed by Mélodie Boillet
Browse files

Add documentation for DAN training

parent 986fdaf0
No related branches found
No related tags found
1 merge request!65Add documentation for DAN training
......@@ -6,7 +6,7 @@ When `teklia-dan` is installed in your environment, you may use the following co
: To preprocess datasets from Arkindex for training. More details in [the dedicated section](./datasets/index.md).
`teklia-dan train`
: To train a new DAN model. More details in [the dedicated section](./train.md).
: To train a new DAN model. More details in [the dedicated section](./train/index.md).
`teklia-dan generate`
: To generate synthetic data to train DAN models. More details in [the dedicated section](./generate.md).
......
# Train
Use the `teklia-dan train` command to train a new DAN model.
Two subcommands are available depending on your dataset:
`line`
: Train a DAN model at line-level.
`document`
: Train a DAN model at document-level.
## Remarks (for pre-training and training)
All hyperparameters are specified and editable in the training scripts (meaning are in comments).
Evaluation is performed just after training ending (training is stopped when the maximum elapsed time is reached or after a maximum number of epoch as specified in the training script).
The outputs files are split into two subfolders:
`checkpoints`
: Contains model weights for the last trained epoch and for the epoch giving the best valid CER.
`results`
: Contains tensorboard log for loss and metrics as well as text file for used hyperparameters and results of evaluation.
# Train
Use the `teklia-dan train` command to train a new DAN model.
Two subcommands are available depending on your dataset:
`line`
: Train a DAN model at line-level and evaluate it.
`document`
: Train a DAN model at document-level and evaluate it.
## Examples
### Document
To train DAN on documents:
1. Set your training configuration in `dan/ocr/document/train.py`. Refer to the [dedicated section](parameters.md) for a description of parameters.
2. Run `teklia-dan train document`.
3. Look into evaluation results in the `output` folder:
* `checkpoints` contains model weights for the last trained epoch and for the epoch giving the best valid CER.
* `results` contains the tensorboard log file, the parameters file, and the evaluation results for the best epoch.
### Line
To train DAN on lines:
1. Set your training configuration in `dan/ocr/line/train.py`. Refer to the [dedicated section](parameters.md) for a description of parameters.
2. Run `teklia-dan train line`.
3. Look into evaluation results in the `output` folder:
* `checkpoints` contains model weights for the last trained epoch and for the epoch giving the best valid CER.
* `results` contains the tensorboard log file, the parameters file, and the evaluation results for the best epoch.
Note that it is possible to run `teklia-dan train document` to train DAN on text lines. However, the configuration must be updated when training on synthetic documents.
## Additional page
* [Jean Zay tutorial](jeanzay.md)
# Training on Jean Zay
See the [wiki](https://redmine.teklia.com/projects/research/wiki/Jean_Zay) for more details.
## Run a training job
Warning: there is no HTTP connection during a job.
You can debug using an interactive job. The following command will get you a new terminal with 1 gpu for 1 hour: `srun --ntasks=1 --cpus-per-task=40 --gres=gpu:1 --time=01:00:00 --qos=qos_gpu-dev --pty bash -i`.
You should run the actual training using a passive/batch job:
* Run `sbatch train_dan.sh`.
* The `train_dan.sh` file should look like the example below.
```sh
#!/bin/bash
#SBATCH --constraint=v100-32g
#SBATCH --qos=qos_gpu-t4 # partition
#SBATCH --job-name=dan_training # name of the job
#SBATCH --gres=gpu:1 # number of GPUs per node
#SBATCH --cpus-per-task=10 # number of cores per tasks
#SBATCH --hint=nomultithread # we get physical cores not logical
#SBATCH --distribution=block:block # we pin the tasks on contiguous cores
#SBATCH --nodes=1 # number of nodes
#SBATCH --ntasks-per-node=1 # number of MPI tasks per node
#SBATCH --time=99:00:00 # max exec time
#SBATCH --output=dan_train_hugin_munin_page_%j.out # output log file
#SBATCH --error=dan_train_hugin_munin_page_%j.err # error log file
module purge # purging modules inherited by default
module load anaconda-py3
conda activate /gpfswork/rech/rxm/ubz97wr/.conda/envs/dan/
# print started commands
set -x
# execution
teklia-dan train document
```
## Supervise a job
* Use `squeue -u $USER`. This command should give an output similar to the one presented below.
```
(base) [ubz97wr@jean-zay1: ubz97wr]$ squeue -u $USER
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1762916 gpu_p13 pylaia_t ubz97wr R 23:07:54 1 r7i6n1
1762954 gpu_p13 pylaia_t ubz97wr R 22:35:57 1 r7i3n1
```
## Delete a job
* Use `scancel $JOBID` to cancel a specific job.
* Use `scancel -u $USER` to cancel all your jobs.
This diff is collapsed.
......@@ -59,7 +59,10 @@ nav:
- usage/datasets/index.md
- Dataset extraction: usage/datasets/extract.md
- Dataset formatting: usage/datasets/format.md
- Training: usage/train.md
- Training:
- usage/train/index.md
- Parameters: usage/train/parameters.md
- Jean Zay tutorial: usage/train/jeanzay.md
- Generate: usage/generate.md
- Predict: usage/predict.md
- Documentation development: dev/build_docs.md
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment