New command to create train/test/val element splits
Refs https://gitlab.com/teklia/requests/-/issues/809
A common task for ML Scientists is to create a new folder on arkindex with 3 sub-folders (train / test / validation). These folders are then used by training workers to build a new model (this is the poor man dataset management.
We need a script that takes as input:
- either a corpus id or a folder ID that will serve as source for the elements
- optional multiple
--element-type
, defaults to["page", ]
- ratios to pick elements and assign them to folders:
-
--train-ratio
, defaults to 0.4 -
--test-ratio
, defaults to 0.3 -
--validation-ratio
, defaults to 0.3
-
- optional
--folder-name
, defaults to "Training dataset" - optional integer
--nb-elements
, default to None
The command workflow is like this:
- list randomly elements that will be used:
- from the corpus when set
- from the folder ID when set
- by using the element types available
- until you reach the
nb_elements
mentionned:- when a positive integer is provided
- no limit when it's set to None (list all elements)
- once you have all the element IDs, split them amongst the 3 destination, by applying the provided ratios
-
⚠ the sum of ratios must be equal to 1 !
-
- create the training folder at the root of the corpus using the mentionned name
- create the tree sub folders (Train, Test, Validation)
- finally, link all the elements in their respective folders
- display a full link towards the Training folder