Generate training dataset

Depends #6 (closed)

We will use the existing function in worker.utils to generate the training dataset from Arkindex elements. In a new worker/generate_dataset.py script, implement a function that takes an Arkindex element as input and creates the corresponding pylaia-formatted data. There will be two parts:

generate the images
generate the train txt file (<path/to/image> <transcription>)

You will need the following CLI arguments:

--element, uuid.UUID, required, uuid of an Element
--arkindex-url, str, url of an Arkindex instance, defaults to os.environ.get("ARKINDEX_API_URL")
--arkindex-token, str, authentication token to an Arkindex instance, defaults to os.environ.get("ARKINDEX_API_TOKEN")

The first part will be a non-class version of retrieve_line_images. The cache-related code is not relevant here, replace the list_element_children call by an API call using an Arkindex API Client. Use this function to download the element's children' images in <tmpdir>/images/<element.id>/.

The second part will use the polygons returned by the first part. You will need to implement:

process_line(polygon: Polygon, image_dir=<tmpdir>/images/<element.id>/) -> str: that takes a polygon and generates the string that will be written in the final train text file. You will need a function tokenize to split the transcription into the correct tokens defined in syms.txt. Something similar is done in the script convert_to_pylaia. The string will be "{image_dir/<line.id>.jpg} {tokenize(line.transcription.text)}". In case the line has multiple transcriptions, use the first one provided by the backend. Skip lines that have no transcription.

Finally the full function will more or less look like

def generate_training_dataset(element_id, data_dir: Path):
    img_dir = mkdir(data_dir / f"images/{element_id}")
    polygons = retrieve_line_images(element_id, img_dir)
    open(data_dir / "train.txt", "w")
    for polygon in polygons:
        file.write(process_line(polygon, img_dir))

Edited Sep 08, 2022 by Yoann Schneider

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information