Snippets Groups Projects

Integration of Worker Runs in extraction

Currently DAN's extraction is using worker versions to identify which transcriptions to get when using the teklia-dan dataset extract with the flag --transcription-worker-version. When a worker version is given there is typically a worker run that it is attached to. For a single transcription-worker-version there could be many transcriptions done by a single version but these worker-versions could be done in different user runs. In this case, I have been told by @yschneider that the program will randomly choose which of the transcriptions to pick from if there are 2 versions. I believe there should be a flag for --user-worker-run to identify what worker run the transcription is from.

Designs

Child items 0

No child items are currently assigned. Use child items to break down this issue into smaller parts.

Activity

Yoann Schneider added P2 Quick Win labels 11 months ago

added P2 Quick Win labels
Yoann Schneider changed milestone to %0.2.1 11 months ago

changed milestone to %0.2.1
Yoann Schneider @yschneider · 11 months ago

Maintainer
Ok TLDR;

We'll implement the following arguments:

--transcription-worker-runs: to filter transcriptions by worker runs

--entity-worker-runs: to filter entities

Add these two to the CLI and in every locations next to their worker-versions equivalent. We will probably need a build_worker_run_filter similar to https://gitlab.teklia.com/atr/dan/-/blob/4d50e8c67d2cec56ba548ad1391568150aa62e04/dan/datasets/extract/db.py#L54
Yoann Schneider assigned to @mblanco 11 months ago

assigned to @mblanco
Manon Blanco mentioned in merge request !416 (merged) 10 months ago

mentioned in merge request !416 (merged)
Yoann Schneider closed with merge request !416 (merged) 10 months ago

closed with merge request !416 (merged)
Yoann Schneider mentioned in commit fcebc8aa 10 months ago

mentioned in commit fcebc8aa

Please register or sign in to reply