Skip to content
Snippets Groups Projects
Commit fc8cfef4 authored by Yoann Schneider's avatar Yoann Schneider :tennis:
Browse files

Merge branch 'remove-files-of-old-repo' of gitlab.com:teklia/atr/dan into remove-files-of-old-repo

parents 4650fed2 a09db0f5
No related branches found
No related tags found
No related merge requests found
...@@ -8,7 +8,7 @@ repos: ...@@ -8,7 +8,7 @@ repos:
rev: 22.6.0 rev: 22.6.0
hooks: hooks:
- id: black - id: black
- repo: https://gitlab.com/pycqa/flake8 - repo: https://github.com/pycqa/flake8
rev: 3.9.2 rev: 3.9.2
hooks: hooks:
- id: flake8 - id: flake8
......
...@@ -104,19 +104,19 @@ The available arguments are ...@@ -104,19 +104,19 @@ The available arguments are
| Parameter | Description | Type | Default | | Parameter | Description | Type | Default |
| ------------------------------ | ----------------------------------------------------------------------------------- | -------- | ------- | | ------------------------------ | ----------------------------------------------------------------------------------- | -------- | ------- |
| `--parent` | UUID of the folder to import from Arkindex. You may specify multiple UUIDs. | str/uuid | | | `--parent` | UUID of the folder to import from Arkindex. You may specify multiple UUIDs. | `str/uuid` | |
| `--element-type` | Type of the elements to extract. You may specify multiple types. | str | | | `--element-type` | Type of the elements to extract. You may specify multiple types. | `str` | |
| `--output` | Folder where the data will be generated. Must exist. | Path | | | `--output` | Folder where the data will be generated. Must exist. | `Path` | |
| `--load-entities` | Extract text with their entities. Needed for NER tasks. | bool | False | | `--load-entities` | Extract text with their entities. Needed for NER tasks. | `bool` | `False` |
| `--tokens` | Mapping between starting tokens and end tokens. Needed for NER tasks. | Path | | | `--tokens` | Mapping between starting tokens and end tokens. Needed for NER tasks. | `Path` | |
| `--use-existing-split` | Use the specified folder IDs for the dataset split. | bool | | | `--use-existing-split` | Use the specified folder IDs for the dataset split. | `bool` | |
| `--train-folder` | ID of the training folder to import from Arkindex. | uuid | | | `--train-folder` | ID of the training folder to import from Arkindex. | `uuid` | |
| `--val-folder` | ID of the validation folder to import from Arkindex. | uuid | | | `--val-folder` | ID of the validation folder to import from Arkindex. | `uuid` | |
| `--test-folder` | ID of the training folder to import from Arkindex. | uuid | | | `--test-folder` | ID of the training folder to import from Arkindex. | `uuid` | |
| `--transcription-worker-version` | Filter transcriptions by worker_version. Use ‘manual’ for manual filtering. | str/uuid | | | `--transcription-worker-version` | Filter transcriptions by worker_version. Use ‘manual’ for manual filtering. | `str/uuid` | |
| `--entity-worker-version` | Filter transcriptions entities by worker_version. Use ‘manual’ for manual filtering | str/uuid | | | `--entity-worker-version` | Filter transcriptions entities by worker_version. Use ‘manual’ for manual filtering | `str/uuid` | |
| `--train-prob` | Training set split size | float | 0,7 | | `--train-prob` | Training set split size | `float` | `0,7` |
| `--val-prob` | Validation set split size | float | 0,15 | | `--val-prob` | Validation set split size | `float` | `0,15` |
The `--tokens` argument expects a file with the following format. The `--tokens` argument expects a file with the following format.
```yaml ```yaml
...@@ -171,7 +171,7 @@ To use the data from three folders as **training**, **validation** and **testing ...@@ -171,7 +171,7 @@ To use the data from three folders as **training**, **validation** and **testing
```shell ```shell
teklia-dan extract \ teklia-dan extract \
--use-existing-split \ --use-existing-split \
--train-folder 2275529a-1ec5-40ce-a516-42ea7ada858c --train-folder 2275529a-1ec5-40ce-a516-42ea7ada858c \
--val-folder af9b38b5-5d95-417d-87ec-730537cb1898 \ --val-folder af9b38b5-5d95-417d-87ec-730537cb1898 \
--test-folder 6ff44957-0e65-48c5-9d77-a178116405b2 \ --test-folder 6ff44957-0e65-48c5-9d77-a178116405b2 \
--element-type page \ --element-type page \
...@@ -193,4 +193,3 @@ teklia-dan extract \ ...@@ -193,4 +193,3 @@ teklia-dan extract \
#### Synthetic data generation #### Synthetic data generation
`teklia-dan generate` with multiple arguments `teklia-dan generate` with multiple arguments
...@@ -4,17 +4,18 @@ ...@@ -4,17 +4,18 @@
Extract dataset from Arkindex using API. Extract dataset from Arkindex using API.
""" """
from collections import defaultdict
import logging import logging
import os import os
import pathlib import pathlib
import random import random
import uuid import uuid
from collections import defaultdict
import imageio.v2 as iio import imageio.v2 as iio
from arkindex import ArkindexClient, options_from_env from arkindex import ArkindexClient, options_from_env
from tqdm import tqdm from tqdm import tqdm
from dan import logger
from dan.datasets.extract.utils import ( from dan.datasets.extract.utils import (
insert_token, insert_token,
parse_tokens, parse_tokens,
...@@ -23,9 +24,6 @@ from dan.datasets.extract.utils import ( ...@@ -23,9 +24,6 @@ from dan.datasets.extract.utils import (
save_text, save_text,
) )
from dan import logger
IMAGES_DIR = "images" # Subpath to the images directory. IMAGES_DIR = "images" # Subpath to the images directory.
LABELS_DIR = "labels" # Subpath to the labels directory. LABELS_DIR = "labels" # Subpath to the labels directory.
MANUAL_SOURCE = "manual" MANUAL_SOURCE = "manual"
......
# -*- coding: utf-8 -*- # -*- coding: utf-8 -*-
import yaml
import json import json
import random import random
import cv2 import cv2
import yaml
random.seed(42) random.seed(42)
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment