Something went wrong on our end
jeanzay.md 2.07 KiB
Training on Jean Zay
See the wiki for more details.
Run a training job
Warning: there is no HTTP connection during a job.
You can debug using an interactive job. The following command will get you a new terminal with 1 gpu for 1 hour: srun --ntasks=1 --cpus-per-task=40 --gres=gpu:1 --time=01:00:00 --qos=qos_gpu-dev --pty bash -i
.
You should run the actual training using a passive/batch job:
- Run
sbatch train_dan.sh
. - The
train_dan.sh
file should look like the example below.
#!/bin/bash
#SBATCH --constraint=v100-32g
#SBATCH --qos=qos_gpu-t4 # partition
#SBATCH --job-name=dan_training # name of the job
#SBATCH --gres=gpu:1 # number of GPUs per node
#SBATCH --cpus-per-task=10 # number of cores per tasks
#SBATCH --hint=nomultithread # we get physical cores not logical
#SBATCH --distribution=block:block # we pin the tasks on contiguous cores
#SBATCH --nodes=1 # number of nodes
#SBATCH --ntasks-per-node=1 # number of MPI tasks per node
#SBATCH --time=99:00:00 # max exec time
#SBATCH --output=dan_train_hugin_munin_page_%j.out # output log file
#SBATCH --error=dan_train_hugin_munin_page_%j.err # error log file
module purge # purging modules inherited by default
module load anaconda-py3
conda activate /gpfswork/rech/rxm/ubz97wr/.conda/envs/dan/
# print started commands
set -x
# execution
teklia-dan train document
Supervise a job
- Use
squeue -u $USER
. This command should give an output similar to the one presented below.
(base) [ubz97wr@jean-zay1: ubz97wr]$ squeue -u $USER
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1762916 gpu_p13 pylaia_t ubz97wr R 23:07:54 1 r7i6n1
1762954 gpu_p13 pylaia_t ubz97wr R 22:35:57 1 r7i3n1
Delete a job
- Use
scancel $JOBID
to cancel a specific job. - Use
scancel -u $USER
to cancel all your jobs.