RuntimeError when formatting LM files
Here are the full error logs:
Extract logs
Extracting data from (ba7ac715-ae46-43f7-bb7b-1c1b8590af2a) for split (train): 0it [00:00, ?it/s]
Extracting data from (ba7ac715-ae46-43f7-bb7b-1c1b8590af2a) for split (train): 0it [00:00, ?it/s]
Extracting data from (4f37d3b5-30db-40e6-bbd5-70a5e994daf3) for split (val): 0it [00:00, ?it/s]
Extracting data from (4f37d3b5-30db-40e6-bbd5-70a5e994daf3) for split (val): 0it [00:00, ?it/s]
Extracting data from (f2ee6f43-80bf-4ba8-9229-6b9d2564b9de) for split (test): 0it [00:00, ?it/s]
Extracting data from (f2ee6f43-80bf-4ba8-9229-6b9d2564b9de) for split (test): 0it [00:00, ?it/s]
Downloading images: 0it [00:00, ?it/s]
Downloading images: 0it [00:00, ?it/s]
2023-11-06 14:55:24,898 INFO/dan.datasets.extract.arkindex: Preparing language resources
2023-11-06 14:55:24,898 INFO/dan.datasets.extract.utils: Training a sentencepiece model for subword tokenization
sentencepiece_trainer.cc(77) LOG(INFO) Starts training with :
trainer_spec {
input: /home/users/ebardou/clients/outputs/hennessy/language_model/tmp_2zpq9um.txt
input_format:
model_prefix: outputs/hennessy/language_model/subword_tokenizer
model_type: UNIGRAM
vocab_size: 1000
self_test_sample_size: 0
character_coverage: 0.9995
input_sentence_size: 0
shuffle_input_sentence: 1
seed_sentencepiece_size: 1000000
shrinking_factor: 0.75
max_sentence_length: 4192
num_threads: 16
num_sub_iterations: 2
max_sentencepiece_length: 16
split_by_unicode_script: 1
split_by_number: 1
split_by_whitespace: 1
split_digits: 0
pretokenization_delimiter:
treat_whitespace_as_suffix: 0
allow_whitespace_only_pieces: 0
user_defined_symbols: Ⓐ
user_defined_symbols: ▁
user_defined_symbols: ◌
user_defined_symbols: ↵
user_defined_symbols: Ⓑ
user_defined_symbols: Ⓒ
required_chars:
byte_fallback: 0
vocabulary_output_piece_score: 1
train_extremely_large_corpus: 0
hard_vocab_limit: 1
use_all_vocab: 0
unk_id: 0
bos_id: 1
eos_id: 2
pad_id: -1
unk_piece: <unk>
bos_piece: <s>
eos_piece: </s>
pad_piece: <pad>
unk_surface: ⁇
enable_differential_privacy: 0
differential_privacy_noise_level: 0
differential_privacy_clipping_threshold: 0
}
normalizer_spec {
name: nmt_nfkc
add_dummy_prefix: 1
remove_extra_whitespaces: 1
escape_whitespaces: 1
normalization_rule_tsv:
}
denormalizer_spec {}
trainer_interface.cc(351) LOG(INFO) SentenceIterator is not specified. Using MultiFileSentenceIterator.
trainer_interface.cc(183) LOG(INFO) Loading corpus: /home/users/ebardou/clients/outputs/hennessy/language_model/tmp_2zpq9um.txt
trainer_interface.cc(407) LOG(INFO) Loaded all 0 sentences
trainer_interface.cc(423) LOG(INFO) Adding meta_piece: <unk>
trainer_interface.cc(423) LOG(INFO) Adding meta_piece: <s>
trainer_interface.cc(423) LOG(INFO) Adding meta_piece: </s>
trainer_interface.cc(423) LOG(INFO) Adding meta_piece: Ⓐ
trainer_interface.cc(423) LOG(INFO) Adding meta_piece: ▁
trainer_interface.cc(423) LOG(INFO) Adding meta_piece: ◌
trainer_interface.cc(423) LOG(INFO) Adding meta_piece: ↵
trainer_interface.cc(423) LOG(INFO) Adding meta_piece: Ⓑ
trainer_interface.cc(423) LOG(INFO) Adding meta_piece: Ⓒ
trainer_interface.cc(428) LOG(INFO) Normalizing sentences...
Traceback (most recent call last):
File "/home/users/ebardou/atr/dan/dan_venv/bin/teklia-dan", line 8, in <module>
sys.exit(main())
File "/home/users/ebardou/atr/dan/dan/cli.py", line 26, in main
status = args.pop("func")(**args)
File "/home/users/ebardou/atr/dan/dan/datasets/extract/arkindex.py", line 532, in run
).run()
File "/home/users/ebardou/atr/dan/dan/datasets/extract/arkindex.py", line 482, in run
self.format_lm_files()
File "/home/users/ebardou/atr/dan/dan/datasets/extract/arkindex.py", line 377, in format_lm_files
tokenizer = Tokenizer(
File "<string>", line 10, in __init__
File "/home/users/ebardou/atr/dan/dan/datasets/extract/utils.py", line 189, in __post_init__
spm.SentencePieceTrainer.train(
File "/home/users/ebardou/atr/dan/dan_venv/lib/python3.10/site-packages/sentencepiece/__init__.py", line 989, in Train
SentencePieceTrainer._Train(arg=arg, **kwargs)
File "/home/users/ebardou/atr/dan/dan_venv/lib/python3.10/site-packages/sentencepiece/__init__.py", line 982, in _Train
return SentencePieceTrainer._TrainFromMap(new_kwargs)
File "/home/users/ebardou/atr/dan/dan_venv/lib/python3.10/site-packages/sentencepiece/__init__.py", line 927, in _TrainFromMap
return _sentencepiece.SentencePieceTrainer__TrainFromMap(args)
RuntimeError: Internal: src/trainer_interface.cc(429) [!sentences_.empty()]
To reproduce, you can use the following SQLite database: /home/users/ebardou/clients/exports/hennessy-letters-20231106-133340.sqlite
And this command:
teklia-dan dataset extract exports/hennessy-letters-20231106-133340.sqlite --train-folder ba7ac715-ae46-43f7-bb7b-1c1b8590af2a --val-folder 4f37d3b5-30db-40e6-bbd5-70a5e994daf3 --test-folder f2ee6f43-80bf-4ba8-9229-6b9d2564b9de --element-type single_page --output outputs/hennessy --tokens ~/atr/dan/tokens.yml --max-width 1250 --max-height 2500 2>&1 | tee extract_logs.log
Edit (YSC): we should crash here with an explicit message if self.data
is empty.
Edited by Yoann Schneider