Extract data without any transcription
For Enguehard, I would like to extract all table_row
s even if they have no transcription (eg https://demo.arkindex.org/element/13125927-2656-4186-8dff-3bf50e734ed1).
With the current extraction code, I have the following warning, and the corresponding element is not exported :
Extracting data from table (beedaacf-ff5f-43c5-b823-99fae8184688) for split (test) (1/10): : 72it [00:02, 36.28it/s]2023-10-02 11:10:45,410 WARNING/dan.datasets.extract.extract: Skipping 5871fde9-9402-4bc1-85e6-c5fb57755fa3: No transcriptions found on element (5871fde9-9402-4bc1-85e6-c5fb57755fa3) with this config. Skipping.
We should have an option to also export elements with no transcription.
To reproduce the warnings:
teklia-dan dataset extract data/enguehard/joseph-enguehard-20231002-064355.sqlite \
--element-type table_row \
--output data/enguehard/ \
--tokens data/enguehard/enguehard_mapping.yml \
--train-folder 1f4cce6c-9dd3-4d55-a04c-b81bada8805e \
--val-folder 40af9ee7-fe1b-448a-8b8f-25ec1e9996a3 \
--test-folder 1d841945-4cbd-4247-afa1-40bee70e192b \
--transcription-worker-version 9b55114b-8c18-42aa-9df1-16c77053bea6 \
--entity-worker-version 9b55114b-8c18-42aa-9df1-16c77053bea6 \
--max-width 2000 \
--parent-element-type table \
--entity-separators $'\n' " "