Implement Dataset models
Refs https://redmine.teklia.com/issues/3653
We need new models in the arkindex.training
Django app to represent training datasets, and manage elements used in different training stages.
classDiagram
Element --> Corpus
Dataset --> Corpus
DatasetElement --> Dataset
DatasetElement --> Element
class DatasetElement {
+UUID PK
+Element element
+Dataset dataset
+String set
+Date created
+Date updated
}
class Dataset{
+UUID PK
+Corpus corpus
+String name
+String description
+User creator
+DatasetMode mode
+Task task
+String[] sets
+Date created
+Date updated
}
With the following constraints
- Unique together on
DatasetElement
:element + dataset + set
- no constraint on
element + dataset
as
- Unique together on
Dataset
corpus + name
Details about fields:
-
DatasetMode
has the following values:-
Open
(default) when elements can be freely assigned to the dataset, -
Building
when the dataset is being built in a process, and elements cannot be added anymore, -
Complete
when the process has succesfully completed -
Error
when the building process has failed
-
-
DatasetElement.set
is really a string without constraints (it will 95% of the time be "train" or "test" or "validation", but we want to support weird datasets with multiple validation sets for example) -
Dataset.name
andDataset.description
cannot be null (we need that description) -
Dataset.task
is nullable (and null by default, it will be populated once we have training artifacts available... and are able to start a training process with datasets) -
Datasets.sets
is an array of strings, which default totrain, test, validation
Finally, a super-admin must be able to manage datasets through the admin. Do not expose DatasetElement as inline as it would be a super long list and unusable.
We'll need to modify the Process
model to support datasets later on, but this is out of scope for this part.
There will be no ACL directly on Datasets, we'll simply use the ones from linked corpus.
Edited by Bastien Abadie