RuntimeError n°2 when formatting LM files
Here are the full error logs:
Extract logs
trainer_interface.cc(686) LOG(INFO) Saving model: outputs/hennessy/language_model/subword_tokenizer.model
Traceback (most recent call last):
File "/home/users/ebardou/atr/dan/dan_venv/bin/teklia-dan", line 8, in <module>
sys.exit(main())
File "/home/users/ebardou/atr/dan/dan/cli.py", line 26, in main
status = args.pop("func")(**args)
File "/home/users/ebardou/atr/dan/dan/datasets/extract/arkindex.py", line 532, in run
).run()
File "/home/users/ebardou/atr/dan/dan/datasets/extract/arkindex.py", line 482, in run
self.format_lm_files()
File "/home/users/ebardou/atr/dan/dan/datasets/extract/arkindex.py", line 377, in format_lm_files
tokenizer = Tokenizer(
File "<string>", line 10, in __init__
File "/home/users/ebardou/atr/dan/dan/datasets/extract/utils.py", line 189, in __post_init__
spm.SentencePieceTrainer.train(
File "/home/users/ebardou/atr/dan/dan_venv/lib/python3.10/site-packages/sentencepiece/__init__.py", line 989, in Train
SentencePieceTrainer._Train(arg=arg, **kwargs)
File "/home/users/ebardou/atr/dan/dan_venv/lib/python3.10/site-packages/sentencepiece/__init__.py", line 982, in _Train
return SentencePieceTrainer._TrainFromMap(new_kwargs)
File "/home/users/ebardou/atr/dan/dan_venv/lib/python3.10/site-packages/sentencepiece/__init__.py", line 927, in _TrainFromMap
return _sentencepiece.SentencePieceTrainer__TrainFromMap(args)
RuntimeError: Internal: src/trainer_interface.cc(661) [(trainer_spec_.vocab_size()) == (model_proto->pieces_size())] Vocabulary size too high (1000). Please set it to a value <= 365.
To reproduce, you can use the following SQLite database: /home/users/ebardou/clients/exports/hennessy-letters-20231106-133340.sqlite
And this command:
teklia-dan dataset extract exports/hennessy-letters-20231106-133340.sqlite --train-folder ba7ac715-ae46-43f7-bb7b-1c1b8590af2a --val-folder 4f37d3b5-30db-40e6-bbd5-70a5e994daf3 --test-folder f2ee6f43-80bf-4ba8-9229-6b9d2564b9de --element-type single_page --parent-element-type single_page --output outputs/hennessy --entity-worker-version 5b642735-5244-4b6c-b045-d55e54eb8972 --tokens ~/atr/dan/tokens.yml --max-width 1250 --max-height 2500 2>&1 | tee extract_logs.log