Skip to content

Update DatasetWorker argument and use ListProcessSets

The --dataset argument will be removed in favor of --set. This argument has a specific format <dataset_id>:<set_name>. The format should be checked during argument parsing.

Remove:

  • DatasetWorker.list_datasets
  • DatasetMixin.list_process_datasets is renamed to DatasetMixin.list_process_sets (called even in read-only mode)

Create a new model in arkindex_worker.models, arkindex_worker.models.Set:

  • name: str, name of the set,
  • dataset, dataset of the set.
  • dataset_path, property (port of Dataset.filepath)

In read-only mode, information about each set will be stored in an iterator. Each value in self.args.set is a string <dataset_id>:<set_name>, and the result should be a arkindex_worker.models.Set. To minimize the API calls, we should call RetrieveDataset using the provided ID, and store the results as a datasets: dict[str, Dataset] instance attribute (mapping IDs). You will have to implement a proper generator to have that logic.

Edited by Yoann Schneider