Document dataset population

4e10f543 · Yoann Schneider · Bastien Abadie · 11764aae · 4e10f543 · 11764aae
Commit 4e10f543 authored 11 months ago by Yoann Schneider Committed by Bastien Abadie 11 months ago
--- a/content/training/datasets/add_from_selection.png
+++ b/content/training/datasets/add_from_selection.png
--- a/content/training/datasets/create.png
+++ b/content/training/datasets/create.png
--- a/content/training/datasets/created.png
+++ b/content/training/datasets/created.png
--- a/content/training/datasets/filled_dataset.png
+++ b/content/training/datasets/filled_dataset.png
--- a/content/training/datasets/index.md
+++ b/content/training/datasets/index.md
@@ -43,14 +43,19 @@ To create a new dataset, click on the **+** button, on the bottom right of the d

 {{ figure(image="training/datasets/create.png", height=500, caption="Create a new dataset") }}

-To create a new dataset, the following fields are mandatory:
- the **name** of the dataset,
- the dataset's **description**.
+Fill a name and a description for your dataset. 

-The **sets** field is optional; if you leave it empty, then your dataset will be created with the following default sets:
- `training`,
- `validation`,
- `test`.
+{% warning() %}
+Names of datasets are unique in a corpus. This means that you cannot select a name already taken by another dataset.
+{% end %}
+
+Select the name of the sets of your dataset. They should match the name supported by the ML technology you plan to use later. You can always rename them later if there is a mismatch.
+
+If you wish to avoid data-leakage, which is having elements in more than one set of your dataset, you can check the **Require unique elements among sets** checkbox.
+
+{{ figure(image="training/datasets/created.png", height=500, caption="Created dataset") }}
+
+The dataset's state is **Open** at first. In this state, you can add elements to your set and edit any of its attribute (name, description, set names, ...).

 ### Edit an existing dataset

@@ -58,6 +63,78 @@ To edit an existing dataset, click on the pencil-shaped icon on the far right of

 Edition is not available for `Complete` datasets.

+### Adding elements to a dataset
+
+#### Using the web interface
+
+Once you have your dataset, you can add elements to each set.
+
+##### Using existing splits
+
+If the elements of each set are already split in separate folders, the procedure to add them to your dataset is easier. 
+
+The flow is the same for each set, but you have to do it separately.
+
+1. Browse to a folder selected for the set
+2. List all elements that should be added, recursively if there are subfolders. 
+3. Add all these elements to the selection. To do that faster, 
+   1. Increase the pagination size to the maximum (**Display** -> **Pagination size**), 
+   2. Use the **Select all displayed elements** button from the **Actions** menu on the right.
+
+Repeat this operation for every folder selected for this set.
+
+When all elements have been selected, browse to the selection, using the icon next to your email address in the navigation bar. The last operations are detailed in [a later section](#add-elements-to-a-dataset-from-selection). At the end, don't forget to **unselect** all elements to avoid [**data-leakage**][1].
+
+##### Create new splits
+
+You first need to decide on the number of elements and the ratios of each split. The number of elements depend on the machine learning technology you are using. Some require larger amounts than others. 
+
+To avoid [**data-leakage**][1], create a folder in your corpus, named after your dataset. This folder will hold all the elements selected for a split.
+
+Then, browse to the folder element which holds the elements you want to use. Add the relevant filters to display your elements. To select `page` elements from anywhere below this folder, add:
+- `recursive=Yes`
+- `type=page`.
+
+{{ figure(image="training/datasets/list_elements.png", height=500, caption="List page elements under a folder") }}
+
+To select elements at random, set the order to `Random`, instead of `Position`. The switch is available on the right of the filter bar.
+
+For easier browsing, you can also increase the pagination size. There are multiple sizes available, pick one that is either:
+- close to the number of elements you wish to select (e.g. `100` if you want to select `95` elements),
+- a divisor of the number of elements (e.g. `100` to select `400` elements).
+
+{{ figure(image="training/datasets/random_order_pagination_size.png", height=500, caption="Display 100 elements per page and order at random") }}
+
+Repeat the following procedure for each set in your dataset.
+1. Browse to the folder which holds the elements, add the filters and random ordering as before,
+2. Set the optimal pagination size depending on the number of elements to add,
+3. Use the **Select all displayed elements** button from the **Actions** menu on the right (you might have to browse multiple pages),
+4. [Add selected elements to the dataset](#add-elements-to-a-dataset-from-selection)
+5. Move elements to the *data-leakage* folder.
+   1. Use the **Move elements** button in the **Actions** menu,
+   2. Select the folder created at the very beginning to avoid data leakage,
+   3. Wait for the asynchronous task to end, it should take a few minutes at most.
+6. **Unselect** all elements, using the dedicated button on the selection page.
+
+##### Add elements to a dataset from selection
+
+Add all elements to the right set of the dataset, using the **Add to a dataset** button from the **Actions** menu. This will open a modal to select the dataset and the set.
+
+{{ figure(image="training/datasets/add_from_selection.png", height=300, caption="Add selected elements to the 'train' set of the 'My Dataset' dataset") }}
+
+A green notification will be displayed when the operation is done. You can browse to the dataset's details page to make sure your elements have been added.
+
+{{ figure(image="training/datasets/filled_dataset.png", height=500, caption="The dataset's 'train' set now has 100 elements") }}
+
+#### Command Line Interface
+
+There is a command-line tool that creates a random dataset from all elements in a folder or a corpus.
+Its documentation is available [here](https://cli.arkindex.org/elements/#creating-data-splits-for-machine-learning).
+
+This tool also supports picking elements from the **whole corpus**.
+
+We recommend `Ubuntu` or `Mac OS X` to use this tool.
+
 ### View the dataset's elements

 To view a dataset's details and its elements, click on the name of a dataset in the list. Circle through the tabs to see the elements in each set.
@@ -92,4 +169,6 @@ These endpoints are the most useful to handle Datasets:

 Once your dataset is ready, you can start training in Arkindex. Learn more about:
 - [creating dataset processes](@/training/dataset-process/index.md),
- [training a model](@/training/train-process/index.md), using said processes.
\ No newline at end of file
+- [training a model](@/training/train-process/index.md), using said processes.
+
+[1]: <https://en.wikipedia.org/wiki/Leakage_(machine_learning)>
--- a/content/training/datasets/list_elements.png
+++ b/content/training/datasets/list_elements.png
--- a/content/training/datasets/random_order_pagination_size.png
+++ b/content/training/datasets/random_order_pagination_size.png