Rename images with line id
Two main changes:
- Images are renamed
{page_id}_{line_number}_{line_id}
(previously{page_id}_{line_number}
) as it is easier to visualize on Arkindex - Line number (e.g
line_032
) should come from the element's name and not fromenumerate
to avoid offset on pages containing lines without transcription
Note: I'm assuming that there are at most 999 lines/image (line_number
has 3 digits)
Merge request reports
Activity
assigned to @starride
added 1 commit
- f83d7ac9 - rename images to {page_id}_{line_number}_{line_id} and read line number from element's name
requested review from @babadie
Yes, it's used to format datasets for HTR. I use it every time I need to train PyLaia on a new dataset, and I think @Chaza_Abdelwahab and @melodie.boillet also use it frequently.
I can try to implement unit tests, but this would be a first for me, so I would probably need some guidance. Could you share a reference project and/or other resources on how to write tests properly ?
added 6 commits
-
7a6d75db...fd06f077 - 2 commits from branch
master
- 5b7929b6 - rename images to {page_id}_{line_number}_{line_id} and read line number from element's name
- 6fd680d9 - Remove useless print
- dabceec7 - Renaming transcription files
- 7e589180 - Fix linting
Toggle commit list-
7a6d75db...fd06f077 - 2 commits from branch
I was able to test withkaldi-data-generator -f kaldi --dataset_name demo --out_dir xxx --pages d29439d3-5078-44ac-a30d-568eff2e483a --transcription_type transcription_line
It's a nice project, @martin_teklia put a lot of effort into it.
I'll add some issues regarding unit tests
@starride You can already look at these slides https://notes.teklia.com/p/IOhDTSui5#/
Once I have written issues, we can discuss