Use the path to dataset instead of path to each folder split

Basically clean up the mess that is https://gitlab.teklia.com/atr/dan/-/blob/d5a223de1b94068bf05f32fb3063ddf6007c9dd1/dan/ocr/document/train.py#L83-L103. We have a labels.json with path for each split. We should use it to load the dataset.