Generate syms.txt
Using the work done in https://gitlab.com/teklia/htr/pylaia_scripts/-/blob/main/pylaia_scripts/convert_to_pylaia.py:
- Port the
Syms
class- remove unrelated functions like
set_phase
orprint_unk_chars
- create a new staticmethod
from_lines(lines: List[str])
that will create a new Syms object and fill its_syms
attribute. This method will add every character (not spaces) from the given line transcriptions to_syms
. - Remove unnecessary code from
char_replace
as thePartition
andread_only
features will not be needed here.
- remove unrelated functions like
The plan is to use it in generate_dataset.py
.
- When you have the valid polygons, we will iterate a first time on the training dataset to retrieve every line's transcription (keep the first transcription found if there are any, skip the line otherwise).
- Then we create the syms instance using
syms = Syms.from_lines(lines)
and save it to disk usingsyms.to_file(syms.txt)
. - Then we can go over the lines again and call the existing
process_line
method but with the tokenized transcription text usingsyms.process_line(transcription["text"]
.
Edited by Yoann Schneider