# Training on Jean Zay See the [wiki](https://redmine.teklia.com/projects/research/wiki/Jean_Zay) for more details. ## Run a training job Warning: there is no HTTP connection during a job. You can debug using an interactive job. The following command will get you a new terminal with 1 gpu for 1 hour: `srun --ntasks=1 --cpus-per-task=40 --gres=gpu:1 --time=01:00:00 --qos=qos_gpu-dev --pty bash -i`. You should run the actual training using a passive/batch job: * Run `sbatch train_dan.sh`. * The `train_dan.sh` file should look like the example below. ```sh #!/bin/bash #SBATCH --constraint=v100-32g #SBATCH --qos=qos_gpu-t4 # partition #SBATCH --job-name=dan_training # name of the job #SBATCH --gres=gpu:1 # number of GPUs per node #SBATCH --cpus-per-task=10 # number of cores per tasks #SBATCH --hint=nomultithread # we get physical cores not logical #SBATCH --distribution=block:block # we pin the tasks on contiguous cores #SBATCH --nodes=1 # number of nodes #SBATCH --ntasks-per-node=1 # number of MPI tasks per node #SBATCH --time=99:00:00 # max exec time #SBATCH --output=dan_train_hugin_munin_page_%j.out # output log file #SBATCH --error=dan_train_hugin_munin_page_%j.err # error log file module purge # purging modules inherited by default module load anaconda-py3 conda activate /gpfswork/rech/rxm/ubz97wr/.conda/envs/dan/ # print started commands set -x # execution teklia-dan train document ``` ## Supervise a job * Use `squeue -u $USER`. This command should give an output similar to the one presented below. ``` (base) [ubz97wr@jean-zay1: ubz97wr]$ squeue -u $USER JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 1762916 gpu_p13 pylaia_t ubz97wr R 23:07:54 1 r7i6n1 1762954 gpu_p13 pylaia_t ubz97wr R 22:35:57 1 r7i3n1 ``` ## Delete a job * Use `scancel $JOBID` to cancel a specific job. * Use `scancel -u $USER` to cancel all your jobs.