PopulateDataset endpoint
https://redmine.teklia.com/issues/7499
Please add a new PopulateDataset
endpoint, a POST
on /api/v1/datasets/<id>/populate/
.
-
It requires to be logged in with a verified user.
-
When the dataset exists, but the user does not have guest access to the corpus, return HTTP 404.
-
When the dataset exists, and the user has guest access to the corpus but not contributor access, return HTTP 403.
-
It accepts some optional parameters in its request body (it is possible to call it with just
{}
):-
parent_id
, UUID of a parent element. The parent element does not have to be a folder.- When this is set, only the elements found within this parent element will be used. Otherwise, all elements in the corpus will be used.
- To avoid extra complexity with access rights, look for the parent element by ID only within the corpus, and return an HTTP 400 stating the element does not exist in this corpus.
-
recursive
, boolean defaulting toFalse
.- When
parent_id
is set, this works likerecursive
inListElementChildren
. - When it is unset,
False
uses top-level elements only, andTrue
takes every element in the corpus, like an opposite oftop_level
inListElements
.
- When
-
types
, a list of ElementType slugs, defaulting to["page"]
.- This restricts the selected elements to those using the specified types.
- If any of the types do not exist, return HTTP 400.
-
count
, a strictly positive integer for the number of elements to fill the dataset with. Defaults to 1000.- If it exceeds the number of elements filtered with the aforementioned fields, return HTTP 400.
-
sets
, a dictionary mapping set names to ratios. Defaults to{"train": 0.8, "dev": 0.1, "test": 0.1}
.- The ratios must be a floating point number between 0 and 1.
- The sum of all ratios must be equal to 1.
-
count * ratio
must be above 1. You cannot ask for 0.0001% of just 3 elements! - When no set exists with the specified name, return HTTP 400.
Here is an example of how a consolidated error for all the sets at once could look like, to follow the error response best practices:
{ "sets": { "__all__": ["The sum of set ratios must be equal to 1."], "potato": [ "This set does not exist.", "Ensure this value is less than or equal to 1." // Default DRF maximum value error ] } }
-
-
When everything is validated, the elements must be added randomly to each of the specified sets according to the ratios, in one transaction, without any asynchronous task.
-
To reproduce the behavior of
arkindex workers ml-splits
, the element count for each set is computed asint(count * ratio)
.Since this does a
floor
, this can leave a few elements without an assigned set. When this occurs, add 1 element to each of the sets until you reachcount
. See theml-splits
source code for reference. -
If an element is already in the set, count it as if it got added. -
If an element is already in another set andDataset.unique_elements
is enabled, do not add it, and count it as if it got added.- When any filtered element is already in any set of the dataset, no matter the
unique_elements
setting, return HTTP 400; see this comment.
- When any filtered element is already in any set of the dataset, no matter the
-
You can compute the element counts alone before sorting or creating anything, so that you don't need extra queries to handle the edge cases.
-
-
Once this is complete, return HTTP 201 with either no response body at all, or just whatever was sent in the request (which is the default behavior of DRF). HTTP 201 takes precedence over HTTP 204, even when there is no content.
-
Due to the high complexity of all of those parameters, the behaviors must be explicitly documented using the endpoint's
description
and thehelp_text
attributes on serializer fields.