Manage dataset sets through a new model
https://redmine.teklia.com/issues/6264
Dataset.sets
is currently an array of strings, and DatasetElement.set
is a string, making set creation, deletion, renames, leak detection, or element counts quite complex. As new features are likely to continue to arrive on datasets, we need to reduce the technical debt sooner rather than later.
Model updates
A new DatasetSet
model should be introduced, with an id
, a name
and a dataset_id
foreign key. Nothing is nullable, and the name cannot be empty. That foreign key has a reverse name of sets
, so that dataset.sets.all()
returns the DatasetSet
for this dataset. There is a unique constraint on name, dataset
to prevent duplicate dataset names.
DatasetElement.dataset
is replaced by a set
foreign key to a DatasetSet
. Having both dataset
and set
on the model means duplicating information, and would cause the same mess we had with a GitRef
being on a repository
and a revision
at the same time. This foreign key should use on_delete=DO_NOTHING
, so that we never hit issues with Django's deletion.
The ProcessDataset
model is replaced by ProcessDatasetSet
, linking a process to a DatasetSet
. Dataset processes will now require users to select sets directly, rather than select datasets and then play around with checkboxes.
classDiagram
direction LR
Dataset *-- DatasetSet : sets
DatasetSet *-- DatasetElement : elements
ProcessDatasetSet --> DatasetSet
ProcessDatasetSet --> Process
DatasetElement --> Element
class DatasetSet {
id
name
dataset_id
}
class DatasetElement {
id
set_id
element_id
}
class ProcessDatasetSet {
id
process_id
set_id
}
The Django admin for datasets should include inline forms to manage the sets. The Django admin for processes is left unchanged.
Data migration
The migration process should be as follows:
- Create the
DatasetSet
andProcessDatasetSet
models, and theDatasetElement.set_id
FK, for now nullable. - Create
DatasetSet
instances for all of the sets in the oldDataset.sets
array. This can be done in oneINSERT INTO … SELECT
withUNNEST
. - Create
DatasetSet
instances for any set found in theDatasetElement
model, with anINSERT INTO … SELECT … ON CONFLICT DO NOTHING
. Since we had zero guarantees thatDatasetElement.set
is always part ofDataset.sets
, we need to handle this edge case for the last time. - Update
DatasetElement
to set theset_id
to the ID of the newDatasetSet
, which is also doable in one query. - Create
ProcessDatasetSet
instances for all of the sets fromProcessDataset
, using anUNNEST
like forDatasetSet
. You will need to join onDatasetSet
to find theset_id
, and that should implicitly exclude any set names that were inProcessDataset.sets
and that do not exist. - Drop
ProcessDataset
,DatasetElement.dataset_id
, theDataset.sets
array, and makeDatasetElement.set_id
non-nullable.
API updates
Datasets
The ListCorpusDatasets
and RetrieveDataset
endpoints could still return the same sets
string array. This should be possible with DRF's SlugRelatedField(many=True)
; despite its name, it does not actually care about the field being a slug or not. But to make set management simpler for the frontend, they can return both the set ID and name.
The CreateDataset
endpoint still accepts the sets
string array, and must create DatasetSet
instances for each specified set with a bulk_create
.
UpdateDataset
and PartialUpdateDataset
no longer handle updating sets
, removing the complex code that manages renames.
CloneDataset
should clone the dataset's sets, and use them when cloning the dataset elements.
Dataset elements
The response of ListDatasetElements
does not need to change, to avoid breaking changes for workers. CreateDatasetElement
and DestroyDatasetElement
can be changed, but they require updates to the CLI, and it shouldn't be necessary to change them as they could retrieve a DatasetSet
by name.
Dataset sets
New endpoints are introduced to manage a dataset's sets, CreateDatasetSet
, [Partial]UpdateDatasetSet
and DestroyDatasetSet
, under /api/v1/dataset/<id>/sets/
and /api/v1/dataset/<id>/sets/<id>/
. They take a name
as their request body. They must return an HTTP 404 if the dataset is not on a readable corpus. They must return an HTTP 400 if the dataset is readable but the user does not have admin access to the corpus, or if the dataset is not open
.
DestroyDatasetSet
should also return an HTTP 400 if this is the last set of the dataset, or if the set is selected by a dataset process. It should destroy all of the related DatasetElements
before destroying the set.
Process set selection
ListProcessDatasets
is used by base-worker, but we got the greenlight to mess with it, as MLProd seems to actually prefer working with sets rather than datasets. It should be renamed ListProcessSets
, because ListProcessDatasetSets
is quite a mouthful. Its response now includes the set ID, set name, and dataset.
CreateProcessDataset
and DestroyProcessDataset
are renamed CreateProcessSet
and DestroyProcessSet
. It might be easiest to have them use /api/v1/<process_id>/set/<set_id>/
directly, but there is more freedom here as these endpoints are only used by the frontend.
UpdateProcessDataset
and PartialUpdateProcessDataset
are removed, since they were used to select sets within a dataset.
Other changes
The corpus deletion code will need to be updated to delete dataset elements and process dataset sets, then dataset sets, then datasets.
The corpus export SQLite structure does not need to change; a string_agg
should allow building the comma-separated list of sets for the dataset
table, and dataset_element
requires just one join. Changing the structure could affect a few workers or scripts.