Skip to content

Implement Dataset models

Refs https://redmine.teklia.com/issues/3653

We need new models in the arkindex.training Django app to represent training datasets, and manage elements used in different training stages.

classDiagram

    Element --> Corpus 
    Dataset --> Corpus  
    DatasetElement --> Dataset 
    DatasetElement --> Element

    class DatasetElement {
      +UUID PK
      +Element element
      +Dataset dataset
      +String set
      +Date created
      +Date updated
    }
    class Dataset{
      +UUID PK
      +Corpus corpus
      +String name
      +String description
      +User creator
      +DatasetMode mode
      +Task task
      +String[] sets
      +Date created
      +Date updated
    }

With the following constraints

  • Unique together on DatasetElement:
    • element + dataset + set
    • no constraint on element + dataset as
  • Unique together on Dataset
    • corpus + name

Details about fields:

  • DatasetMode has the following values:
    • Open (default) when elements can be freely assigned to the dataset,
    • Building when the dataset is being built in a process, and elements cannot be added anymore,
    • Complete when the process has succesfully completed
    • Error when the building process has failed
  • DatasetElement.set is really a string without constraints (it will 95% of the time be "train" or "test" or "validation", but we want to support weird datasets with multiple validation sets for example)
  • Dataset.name and Dataset.description cannot be null (we need that description)
  • Dataset.task is nullable (and null by default, it will be populated once we have training artifacts available... and are able to start a training process with datasets)
  • Datasets.sets is an array of strings, which default to train, test, validation

Finally, a super-admin must be able to manage datasets through the admin. Do not expose DatasetElement as inline as it would be a super long list and unusable.

We'll need to modify the Process model to support datasets later on, but this is out of scope for this part.

There will be no ACL directly on Datasets, we'll simply use the ones from linked corpus.

Edited by Bastien Abadie