Implement Dataset models
Refs https://redmine.teklia.com/issues/3653
We need new models in the arkindex.training Django app to represent training datasets, and manage elements used in different training stages.
classDiagram
Element --> Corpus
Dataset --> Corpus
DatasetElement --> Dataset
DatasetElement --> Element
class DatasetElement {
+UUID PK
+Element element
+Dataset dataset
+String set
+Date created
+Date updated
}
class Dataset{
+UUID PK
+Corpus corpus
+String name
+String description
+User creator
+DatasetMode mode
+Task task
+String[] sets
+Date created
+Date updated
}
With the following constraints
- Unique together on
DatasetElement:element + dataset + set- no constraint on
element + datasetas
- Unique together on
Datasetcorpus + name
Details about fields:
-
DatasetModehas the following values:-
Open(default) when elements can be freely assigned to the dataset, -
Buildingwhen the dataset is being built in a process, and elements cannot be added anymore, -
Completewhen the process has succesfully completed -
Errorwhen the building process has failed
-
-
DatasetElement.setis really a string without constraints (it will 95% of the time be "train" or "test" or "validation", but we want to support weird datasets with multiple validation sets for example) -
Dataset.nameandDataset.descriptioncannot be null (we need that description) -
Dataset.taskis nullable (and null by default, it will be populated once we have training artifacts available... and are able to start a training process with datasets) -
Datasets.setsis an array of strings, which default totrain, test, validation
Finally, a super-admin must be able to manage datasets through the admin. Do not expose DatasetElement as inline as it would be a super long list and unusable.
We'll need to modify the Process model to support datasets later on, but this is out of scope for this part.
There will be no ACL directly on Datasets, we'll simply use the ones from linked corpus.
Edited by Bastien Abadie