Skip to content

Manage dataset sets through a new model

https://redmine.teklia.com/issues/6264

Dataset.sets is currently an array of strings, and DatasetElement.set is a string, making set creation, deletion, renames, leak detection, or element counts quite complex. As new features are likely to continue to arrive on datasets, we need to reduce the technical debt sooner rather than later.

Model updates

A new DatasetSet model should be introduced, with an id, a name and a dataset_id foreign key. Nothing is nullable, and the name cannot be empty. That foreign key has a reverse name of sets, so that dataset.sets.all() returns the DatasetSet for this dataset. There is a unique constraint on name, dataset to prevent duplicate dataset names.

DatasetElement.dataset is replaced by a set foreign key to a DatasetSet. Having both dataset and set on the model means duplicating information, and would cause the same mess we had with a GitRef being on a repository and a revision at the same time. This foreign key should use on_delete=DO_NOTHING, so that we never hit issues with Django's deletion.

The ProcessDataset model is replaced by ProcessDatasetSet, linking a process to a DatasetSet. Dataset processes will now require users to select sets directly, rather than select datasets and then play around with checkboxes.

classDiagram
direction LR
Dataset *-- DatasetSet : sets
DatasetSet *-- DatasetElement : elements
ProcessDatasetSet --> DatasetSet
ProcessDatasetSet --> Process
DatasetElement --> Element
class DatasetSet {
id
name
dataset_id
}
class DatasetElement {
id
set_id
element_id
}
class ProcessDatasetSet {
id
process_id
set_id
}

The Django admin for datasets should include inline forms to manage the sets. The Django admin for processes is left unchanged.

Data migration

The migration process should be as follows:

  1. Create the DatasetSet and ProcessDatasetSet models, and the DatasetElement.set_id FK, for now nullable.
  2. Create DatasetSet instances for all of the sets in the old Dataset.sets array. This can be done in one INSERT INTO … SELECT with UNNEST.
  3. Create DatasetSet instances for any set found in the DatasetElement model, with an INSERT INTO … SELECT … ON CONFLICT DO NOTHING. Since we had zero guarantees that DatasetElement.set is always part of Dataset.sets, we need to handle this edge case for the last time.
  4. Update DatasetElement to set the set_id to the ID of the new DatasetSet, which is also doable in one query.
  5. Create ProcessDatasetSet instances for all of the sets from ProcessDataset, using an UNNEST like for DatasetSet. You will need to join on DatasetSet to find the set_id, and that should implicitly exclude any set names that were in ProcessDataset.sets and that do not exist.
  6. Drop ProcessDataset, DatasetElement.dataset_id, the Dataset.sets array, and make DatasetElement.set_id non-nullable.

API updates

Datasets

The ListCorpusDatasets and RetrieveDataset endpoints could still return the same sets string array. This should be possible with DRF's SlugRelatedField(many=True); despite its name, it does not actually care about the field being a slug or not. But to make set management simpler for the frontend, they can return both the set ID and name.

The CreateDataset endpoint still accepts the sets string array, and must create DatasetSet instances for each specified set with a bulk_create.

UpdateDataset and PartialUpdateDataset no longer handle updating sets, removing the complex code that manages renames.

CloneDataset should clone the dataset's sets, and use them when cloning the dataset elements.

Dataset elements

The response of ListDatasetElements does not need to change, to avoid breaking changes for workers. CreateDatasetElement and DestroyDatasetElement can be changed, but they require updates to the CLI, and it shouldn't be necessary to change them as they could retrieve a DatasetSet by name.

Dataset sets

New endpoints are introduced to manage a dataset's sets, CreateDatasetSet, [Partial]UpdateDatasetSet and DestroyDatasetSet, under /api/v1/dataset/<id>/sets/ and /api/v1/dataset/<id>/sets/<id>/. They take a name as their request body. They must return an HTTP 404 if the dataset is not on a readable corpus. They must return an HTTP 400 if the dataset is readable but the user does not have admin access to the corpus, or if the dataset is not open.

DestroyDatasetSet should also return an HTTP 400 if this is the last set of the dataset, or if the set is selected by a dataset process. It should destroy all of the related DatasetElements before destroying the set.

Process set selection

ListProcessDatasets is used by base-worker, but we got the greenlight to mess with it, as MLProd seems to actually prefer working with sets rather than datasets. It should be renamed ListProcessSets, because ListProcessDatasetSets is quite a mouthful. Its response now includes the set ID, set name, and dataset.

CreateProcessDataset and DestroyProcessDataset are renamed CreateProcessSet and DestroyProcessSet. It might be easiest to have them use /api/v1/<process_id>/set/<set_id>/ directly, but there is more freedom here as these endpoints are only used by the frontend.

UpdateProcessDataset and PartialUpdateProcessDataset are removed, since they were used to select sets within a dataset.

Other changes

The corpus deletion code will need to be updated to delete dataset elements and process dataset sets, then dataset sets, then datasets.

The corpus export SQLite structure does not need to change; a string_agg should allow building the comma-separated list of sets for the dataset table, and dataset_element requires just one join. Changing the structure could affect a few workers or scripts.

Edited by Erwan Rouchet