Skip to content

RuntimeError n°2 when formatting LM files

Here are the full error logs:

Extract logs
trainer_interface.cc(686) LOG(INFO) Saving model: outputs/hennessy/language_model/subword_tokenizer.model
Traceback (most recent call last):
  File "/home/users/ebardou/atr/dan/dan_venv/bin/teklia-dan", line 8, in <module>
    sys.exit(main())
  File "/home/users/ebardou/atr/dan/dan/cli.py", line 26, in main
    status = args.pop("func")(**args)
  File "/home/users/ebardou/atr/dan/dan/datasets/extract/arkindex.py", line 532, in run
    ).run()
  File "/home/users/ebardou/atr/dan/dan/datasets/extract/arkindex.py", line 482, in run
    self.format_lm_files()
  File "/home/users/ebardou/atr/dan/dan/datasets/extract/arkindex.py", line 377, in format_lm_files
    tokenizer = Tokenizer(
  File "<string>", line 10, in __init__
  File "/home/users/ebardou/atr/dan/dan/datasets/extract/utils.py", line 189, in __post_init__
    spm.SentencePieceTrainer.train(
  File "/home/users/ebardou/atr/dan/dan_venv/lib/python3.10/site-packages/sentencepiece/__init__.py", line 989, in Train
    SentencePieceTrainer._Train(arg=arg, **kwargs)
  File "/home/users/ebardou/atr/dan/dan_venv/lib/python3.10/site-packages/sentencepiece/__init__.py", line 982, in _Train
    return SentencePieceTrainer._TrainFromMap(new_kwargs)
  File "/home/users/ebardou/atr/dan/dan_venv/lib/python3.10/site-packages/sentencepiece/__init__.py", line 927, in _TrainFromMap
    return _sentencepiece.SentencePieceTrainer__TrainFromMap(args)
RuntimeError: Internal: src/trainer_interface.cc(661) [(trainer_spec_.vocab_size()) == (model_proto->pieces_size())] Vocabulary size too high (1000). Please set it to a value <= 365.

To reproduce, you can use the following SQLite database: /home/users/ebardou/clients/exports/hennessy-letters-20231106-133340.sqlite

And this command:

teklia-dan dataset extract exports/hennessy-letters-20231106-133340.sqlite --train-folder ba7ac715-ae46-43f7-bb7b-1c1b8590af2a --val-folder 4f37d3b5-30db-40e6-bbd5-70a5e994daf3 --test-folder f2ee6f43-80bf-4ba8-9229-6b9d2564b9de --element-type single_page --parent-element-type single_page --output outputs/hennessy --entity-worker-version 5b642735-5244-4b6c-b045-d55e54eb8972 --tokens ~/atr/dan/tokens.yml --max-width 1250 --max-height 2500 2>&1 | tee extract_logs.log