Do not decompose special tokens in `lexicon.txt`
When generating the lexicon file, special tokens (<ctc>
, <unk>
, <space>
) should not be decomposed into characters.
This behavior is mentioned in the comments, but the code does not do anything special for these tokens.
We end up with a lexicon file that looks like:
<ctc> < c t c >
a a
b b
c c
<space> < s p a c e >
But we need:
- for a character-base LM
<ctc> <ctc>
a a
b b
c c
<space> <space>
- for a word-based LM
<ctc> <ctc>
vendredi v e n d r e d i
huit h u i t
septembre s e p t e m b r e
<space> <space>