Generate training dataset
Depends #6 (closed)
We will use the existing function in worker.utils
to generate the training dataset from Arkindex elements. In a new worker/generate_dataset.py
script, implement a function that takes an Arkindex element as input and creates the corresponding pylaia-formatted data. There will be two parts:
- generate the images
- generate the train txt file (
<path/to/image> <transcription>
)
You will need the following CLI arguments:
-
--element
,uuid.UUID
, required, uuid of an Element -
--arkindex-url
,str
, url of an Arkindex instance, defaults toos.environ.get("ARKINDEX_API_URL")
-
--arkindex-token
,str
, authentication token to an Arkindex instance, defaults toos.environ.get("ARKINDEX_API_TOKEN")
The first part will be a non-class version of retrieve_line_images. The cache-related code is not relevant here, replace the list_element_children
call by an API call using an Arkindex API Client. Use this function to download the element's children' images in <tmpdir>/images/<element.id>/
.
The second part will use the polygons returned by the first part. You will need to implement:
-
process_line(polygon: Polygon, image_dir=<tmpdir>/images/<element.id>/) -> str
: that takes a polygon and generates the string that will be written in the final train text file. You will need a functiontokenize
to split the transcription into the correct tokens defined insyms.txt
. Something similar is done in the script convert_to_pylaia. The string will be"{image_dir/<line.id>.jpg} {tokenize(line.transcription.text)}"
. In case the line has multiple transcriptions, use the first one provided by the backend. Skip lines that have no transcription.
Finally the full function will more or less look like
def generate_training_dataset(element_id, data_dir: Path):
img_dir = mkdir(data_dir / f"images/{element_id}")
polygons = retrieve_line_images(element_id, img_dir)
open(data_dir / "train.txt", "w")
for polygon in polygons:
file.write(process_line(polygon, img_dir))