Rework the worker due to `Dataset` API changes
All threads resolved!
All threads resolved!
Closes #14 (closed), #16 (closed)
Edited by Eva Bardou
Merge request reports
Activity
assigned to @ebardou
Tested locally using this dataset:
export ARKINDEX_CORPUS_ID=b3a664de-eecc-4ca0-9ce5-939a650eee02 export ARKINDEX_API_URL=https://ee.preprod.arkindex.teklia.com/api/v1/ ARKINDEX_API_TOKEN=XXX worker-generic-training-dataset --set a65c5e0c-5126-44de-8d68-674f6ac071cc:training a65c5e0c-5126-44de-8d68-674f6ac071cc:validation a65c5e0c-5126-44de-8d68-674f6ac071cc:test --dev 2024-04-04 16:43:11,038 WARNING/arkindex_worker: Missing ARKINDEX_WORKER_RUN_ID environment variable, worker is in read-only mode 2024-04-04 16:43:11,038 INFO/arkindex_worker: Worker will use /home/eva/.local/share/arkindex as working directory 2024-04-04 16:43:12,172 WARNING/arkindex_worker: Running without any extra configuration 2024-04-04 16:43:12,172 INFO/arkindex.pagination: Loading first page on try 1/5 2024-04-04 16:43:12,364 INFO/arkindex.pagination: Pagination will load a total of 1 page. 2024-04-04 16:43:12,364 INFO/worker_generic_training_dataset.worker: Downloading export (4a3af150-730b-42b5-8289-23b9f0d084fd)... 2024-04-04 16:43:13,050 INFO/worker_generic_training_dataset.worker: Downloaded export (4a3af150-730b-42b5-8289-23b9f0d084fd) @ `/tmp/test-eva-20240404-143748.sqlite` 2024-04-04 16:43:13,111 INFO/worker_generic_training_dataset.worker: Building Dataset (a65c5e0c-5126-44de-8d68-674f6ac071cc) (1/1) 2024-04-04 16:43:13,111 WARNING/arkindex_worker: Cannot update dataset as this worker is in read-only mode 2024-04-04 16:43:13,112 INFO/worker_generic_training_dataset.worker: Processing Dataset (a65c5e0c-5126-44de-8d68-674f6ac071cc) (1/1) 2024-04-04 16:43:13,112 INFO/worker_generic_training_dataset.worker: Cached database will be saved at `/tmp/tmpsf1y946a-arkindex-data/db.sqlite`. 2024-04-04 16:43:13,113 INFO/arkindex_worker: Connected to cache on /tmp/tmpsf1y946a-arkindex-data/db.sqlite 2024-04-04 16:43:13,264 INFO/worker_generic_training_dataset.worker: Images will be saved at `/tmp/tmpsf1y946a-arkindex-data/images`. 2024-04-04 16:43:13,265 INFO/worker_generic_training_dataset.worker: Inserting dataset (a65c5e0c-5126-44de-8d68-674f6ac071cc) 2024-04-04 16:43:13,272 INFO/arkindex.pagination: Loading first page on try 1/5 2024-04-04 16:43:13,363 INFO/arkindex.pagination: Pagination will load a total of 1 page. 2024-04-04 16:43:13,365 INFO/worker_generic_training_dataset.worker: Filling the cache with information from elements in the split training 2024-04-04 16:43:13,366 INFO/worker_generic_training_dataset.worker: Processing `training` element (1/1) 2024-04-04 16:43:13,366 INFO/worker_generic_training_dataset.worker: Processing element (735fa0da-f7de-45fd-b0a3-eceb93b342ae) 2024-04-04 16:43:13,366 INFO/worker_generic_training_dataset.worker: Downloading image 2024-04-04 16:43:14,120 INFO/arkindex_worker: Downloaded image https://preprod-arkindex-iiif.europe.iiif.teklia.com/iiif/2/ddf8fcbd-579e-449c-9d89-8f5ef9773d69/0,0,800,711/full/0/default.jpg - size=800x711 in 0:00:00.664373 2024-04-04 16:43:14,142 INFO/worker_generic_training_dataset.worker: Inserting image 2024-04-04 16:43:14,156 INFO/worker_generic_training_dataset.worker: Inserting element 2024-04-04 16:43:14,165 INFO/worker_generic_training_dataset.worker: Listing classifications 2024-04-04 16:43:14,166 INFO/worker_generic_training_dataset.worker: Listing transcriptions 2024-04-04 16:43:14,167 INFO/worker_generic_training_dataset.worker: Listing entities 2024-04-04 16:43:14,167 INFO/worker_generic_training_dataset.worker: Linking element 735fa0da-f7de-45fd-b0a3-eceb93b342ae to dataset (a65c5e0c-5126-44de-8d68-674f6ac071cc) 2024-04-04 16:43:14,182 INFO/arkindex.pagination: Loading first page on try 1/5 2024-04-04 16:43:14,266 INFO/arkindex.pagination: Pagination will load a total of 1 page. 2024-04-04 16:43:14,268 INFO/worker_generic_training_dataset.worker: Filling the cache with information from elements in the split validation 2024-04-04 16:43:14,268 INFO/worker_generic_training_dataset.worker: Processing `validation` element (1/1) 2024-04-04 16:43:14,268 INFO/worker_generic_training_dataset.worker: Processing element (ab3fbf5b-8644-4d84-a111-adde4415a4fd) 2024-04-04 16:43:14,268 INFO/worker_generic_training_dataset.worker: Downloading image 2024-04-04 16:43:15,587 INFO/arkindex_worker: Downloaded image https://preprod-arkindex-iiif.europe.iiif.teklia.com/iiif/2/e8dcc924-f711-489c-9c55-8479d79baf43/0,0,1200,1674/full/0/default.jpg - size=1200x1674 in 0:00:01.195585 2024-04-04 16:43:15,626 INFO/worker_generic_training_dataset.worker: Inserting image 2024-04-04 16:43:15,635 INFO/worker_generic_training_dataset.worker: Inserting element 2024-04-04 16:43:15,643 INFO/worker_generic_training_dataset.worker: Listing classifications 2024-04-04 16:43:15,645 INFO/worker_generic_training_dataset.worker: Listing transcriptions 2024-04-04 16:43:15,646 INFO/worker_generic_training_dataset.worker: Listing entities 2024-04-04 16:43:15,646 INFO/worker_generic_training_dataset.worker: Linking element ab3fbf5b-8644-4d84-a111-adde4415a4fd to dataset (a65c5e0c-5126-44de-8d68-674f6ac071cc) 2024-04-04 16:43:15,659 INFO/arkindex.pagination: Loading first page on try 1/5 2024-04-04 16:43:15,746 INFO/arkindex.pagination: Pagination will load a total of 1 page. 2024-04-04 16:43:15,747 INFO/worker_generic_training_dataset.worker: Filling the cache with information from elements in the split test 2024-04-04 16:43:15,747 INFO/worker_generic_training_dataset.worker: Processing `test` element (1/1) 2024-04-04 16:43:15,747 INFO/worker_generic_training_dataset.worker: Processing element (10262955-e628-4e6c-bfd9-04e3060aa74d) 2024-04-04 16:43:15,748 INFO/worker_generic_training_dataset.worker: Downloading image 2024-04-04 16:43:17,210 INFO/arkindex_worker: Downloaded image https://preprod-arkindex-iiif.europe.iiif.teklia.com/iiif/2/1b158a8f-b9d9-4e64-a5db-f9fbc8a50db0/0,0,1500,998/full/0/default.jpg - size=1500x998 in 0:00:01.251458 2024-04-04 16:43:17,254 INFO/worker_generic_training_dataset.worker: Inserting image 2024-04-04 16:43:17,272 INFO/worker_generic_training_dataset.worker: Inserting element 2024-04-04 16:43:17,281 INFO/worker_generic_training_dataset.worker: Listing classifications 2024-04-04 16:43:17,282 INFO/worker_generic_training_dataset.worker: Listing transcriptions 2024-04-04 16:43:17,282 INFO/worker_generic_training_dataset.worker: Listing entities 2024-04-04 16:43:17,282 INFO/worker_generic_training_dataset.worker: Linking element 10262955-e628-4e6c-bfd9-04e3060aa74d to dataset (a65c5e0c-5126-44de-8d68-674f6ac071cc) 2024-04-04 16:43:17,291 INFO/worker_generic_training_dataset.worker: Compressing the images to /home/eva/.local/share/arkindex/a65c5e0c-5126-44de-8d68-674f6ac071cc.tar.zst 2024-04-04 16:43:17,302 INFO/worker_generic_training_dataset.worker: Completed Dataset (a65c5e0c-5126-44de-8d68-674f6ac071cc) (1/1) 2024-04-04 16:43:17,302 WARNING/arkindex_worker: Cannot update dataset as this worker is in read-only mode 2024-04-04 16:43:17,302 INFO/worker_generic_training_dataset.worker: Ran on 1 dataset: 1 completed, 0 failed
Which generated the following archive a65c5e0c-5126-44de-8d68-674f6ac071cc.tar.zst.
I was not able to test the worker directly on preprod as the frontend to launch
DatasetProcesses
is currently broken there.requested review from @yschneider
added 4 commits
-
69505172...cf632f21 - 2 commits from branch
main
- de759487 - Bump arkindex-base-worker to 0.3.7rc7
- c93af222 - Rework the worker
-
69505172...cf632f21 - 2 commits from branch
- Resolved by Eva Bardou
- Resolved by Eva Bardou
- Resolved by Eva Bardou
requested review from @yschneider
enabled an automatic merge when the pipeline for 951bf9f1 succeeds
enabled an automatic merge when the pipeline for a5726c5c succeeds
mentioned in commit e4452dc1
Please register or sign in to reply