PopulateDataset endpoint

https://redmine.teklia.com/issues/7499

Please add a new PopulateDataset endpoint, a POST on /api/v1/datasets/<id>/populate/.

It requires to be logged in with a verified user.
When the dataset exists, but the user does not have guest access to the corpus, return HTTP 404.
When the dataset exists, and the user has guest access to the corpus but not contributor access, return HTTP 403.
It accepts some optional parameters in its request body (it is possible to call it with just {}):
- parent_id, UUID of a parent element. The parent element does not have to be a folder.
  - When this is set, only the elements found within this parent element will be used. Otherwise, all elements in the corpus will be used.
  - To avoid extra complexity with access rights, look for the parent element by ID only within the corpus, and return an HTTP 400 stating the element does not exist in this corpus.
- recursive, boolean defaulting to False.
  - When parent_id is set, this works like recursive in ListElementChildren.
  - When it is unset, False uses top-level elements only, and True takes every element in the corpus, like an opposite of top_level in ListElements.
- types, a list of ElementType slugs, defaulting to ["page"].
  - This restricts the selected elements to those using the specified types.
  - If any of the types do not exist, return HTTP 400.
- count, a strictly positive integer for the number of elements to fill the dataset with. Defaults to 1000.
  - If it exceeds the number of elements filtered with the aforementioned fields, return HTTP 400.
- sets, a dictionary mapping set names to ratios. Defaults to {"train": 0.8, "dev": 0.1, "test": 0.1}.
  - The ratios must be a floating point number between 0 and 1.
  - The sum of all ratios must be equal to 1.
  - count * ratio must be above 1. You cannot ask for 0.0001% of just 3 elements!
  - When no set exists with the specified name, return HTTP 400.
  Here is an example of how a consolidated error for all the sets at once could look like, to follow the error response best practices:
```
{
  "sets": {
    "__all__": ["The sum of set ratios must be equal to 1."],
    "potato": [
      "This set does not exist.",
      "Ensure this value is less than or equal to 1." // Default DRF maximum value error
    ]
  }
}
```
When everything is validated, the elements must be added randomly to each of the specified sets according to the ratios, in one transaction, without any asynchronous task.
- To reproduce the behavior of arkindex workers ml-splits, the element count for each set is computed as int(count * ratio).
  
  Since this does a floor, this can leave a few elements without an assigned set. When this occurs, add 1 element to each of the sets until you reach count. See the ml-splits source code for reference.
- ~~If an element is already in the set, count it as if it got added.~~
- ~~If an element is already in another set and Dataset.unique_elements is enabled, do not add it, and count it as if it got added.~~
  - When any filtered element is already in any set of the dataset, no matter the unique_elements setting, return HTTP 400; see this comment.
- You can compute the element counts alone before sorting or creating anything, so that you don't need extra queries to handle the edge cases.
Once this is complete, return HTTP 201 with either no response body at all, or just whatever was sent in the request (which is the default behavior of DRF). HTTP 201 takes precedence over HTTP 204, even when there is no content.
Due to the high complexity of all of those parameters, the behaviors must be explicitly documented using the endpoint's description and the help_text attributes on serializer fields.

Edited Jun 11, 2024 by Erwan Rouchet