# Training on Jean Zay

See the [wiki](https://redmine.teklia.com/projects/research/wiki/Jean_Zay) for more details.

## Run a training job
Warning: there is no HTTP connection during a job.

You can debug using an interactive job. The following command will get you a new terminal with 1 gpu for 1 hour: `srun --ntasks=1 --cpus-per-task=40 --gres=gpu:1 --time=01:00:00 --qos=qos_gpu-dev --pty bash -i`.

You should run the actual training using a passive/batch job:
* Run `sbatch train_dan.sh`.
* The `train_dan.sh` file should look like the example below.

```sh
#!/bin/bash
#SBATCH --constraint=v100-32g
#SBATCH --qos=qos_gpu-t4                # partition
#SBATCH --job-name=dan_training         # name of the job
#SBATCH --gres=gpu:1                    # number of GPUs per node
#SBATCH --cpus-per-task=10              # number of cores per tasks
#SBATCH --hint=nomultithread            # we get physical cores not logical
#SBATCH --distribution=block:block      # we pin the tasks on contiguous cores
#SBATCH --nodes=1                       # number of nodes
#SBATCH --ntasks-per-node=1             # number of MPI tasks per node
#SBATCH --time=99:00:00                 # max exec time
#SBATCH --output=dan_train_hugin_munin_page_%j.out         # output log file
#SBATCH --error=dan_train_hugin_munin_page_%j.err          # error log file

module purge                            # purging modules inherited by default
module load anaconda-py3

conda activate /gpfswork/rech/rxm/ubz97wr/.conda/envs/dan/

# print started commands
set -x

# execution
teklia-dan train document
```

## Supervise a job
* Use `squeue -u $USER`. This command should give an output similar to the one presented below.
```
(base) [ubz97wr@jean-zay1: ubz97wr]$ squeue -u $USER
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
           1762916   gpu_p13 pylaia_t  ubz97wr  R   23:07:54      1 r7i6n1
           1762954   gpu_p13 pylaia_t  ubz97wr  R   22:35:57      1 r7i3n1
```

## Delete a job
* Use `scancel $JOBID` to cancel a specific job.
* Use `scancel -u $USER` to cancel all your jobs.