Investigate oom error
DAN training always fails (at least on Jean Zay) due to an OOM error:
EPOCH 2201/3000: 54%|█████▍ | 1116/2057 [00:42<00:31, 29.91it/s, values={'loss_ce': 0.0759, 'cer': 0.0245, 'wer': 0.1367, 'wer_no_punct'
EPOCH 2201/3000: 55%|█████▍ | 1128/2057 [00:42<00:29, 31.50it/s, values={'loss_ce': 0.0759, 'cer': 0.0245, 'wer': 0.1367, 'wer_no_punct': 0.1367}]
/var/spool/slurmd/job1128823/slurm_script: line 45: 1020178 Killed
slurmstepd: error: Detected 18 oom-kill event(s) in StepId=1128823.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.
I have already found some issues that need to be fixed, and then tested to ensure the training doesn't get killed anymore:
- Image features are duplicated in the
train_batch
function; - Images and labels are also duplicated multiple times during batch creation;
- Images are duplicated in handcrafted transformations: ErosionDilation and DPIAdjusting.