Normalize wer computation

assigned to @starride

added 1 commit

bfea2f97 - compute WER without punctuation

Compare with previous version

added 1 commit

07114cee - display it during training

Compare with previous version

added 1 commit

8fe63856 - fix indentation for syn_lines_proba

Compare with previous version

I updated format_string_for_wer to remove the punctuation if remove_punct=True.

>>> format_string_for_wer("Hello! This is a string, and it contains punctuation.", layout_tokens=None, remove_punct=False)
['Hello!', 'This', 'is', 'a', 'string,', 'and', 'it', 'contains', 'punctuation.']

>>> format_string_for_wer("Hello! This is a string, and it contains punctuation.", layout_tokens=None, remove_punct=True)
['Hello', 'This', 'is', 'a', 'string', 'and', 'it', 'contains', 'punctuation']

Here is a training log:

EPOCH 84/50000: 100%|██████████████████████████| 7/7 [00:00<00:00, 10.38it/s, values={'loss_ce': 1.2848, 'cer': 0.3093, 'wer': 0.7069, 'wer_no_punct': 0.6897, 'syn_max_lines': 1.0, 'syn_prob_lines': 0.9}]
EPOCH 85/50000: 100%|██████████████████████████| 7/7 [00:00<00:00,  9.83it/s, values={'loss_ce': 1.3133, 'cer': 0.378, 'wer': 0.7416, 'wer_no_punct': 0.7273, 'syn_max_lines': 1.0, 'syn_prob_lines': 0.9}]
Evaluation E85: 100%|██████████████████████████| 7/7 [00:00<00:00,  8.26it/s, values={'cer': 1.0, 'wer': 1.0, 'wer_no_punct': 1.0}]

Note: I also fixed an indentation error in get_syn_proba_lines

marked this merge request as ready

requested review from @schneider-y

Please create an issue and move this MR's description over there :)

changed milestone to %ML Prod - December 2022 n°1

added P2 label

approved this merge request

changed the description

merged

Normalize wer computation

Merge request reports

Activity