Support subword and word language models
Merge request reports
Activity
added P1 label
assigned to @starride
added 32 commits
-
dcf4925f...04809033 - 3 commits from branch
main
- 04809033...57684efb - 19 earlier commits
- c64f30f2 - Document prediction with language model
- 2a95fb27 - Document prediction command
- 9a638552 - Improve code, typing and logging
- c0967ee5 - Improve code
- 00d54e0c - Deal with unknown token separately
- 3c663e40 - Prepare language files for word and subword LM
- f5679edf - Add nltk and sentencepiece to requirements
- 0bd6e048 - Use the same space token as sentencepiece
- 13220228 - Write tests
- a74d6cf3 - Fix linting
Toggle commit list-
dcf4925f...04809033 - 3 commits from branch
added 31 commits
-
29b43932 - 1 commit from branch
main
- 29b43932...4353f420 - 20 earlier commits
- 03d00d45 - Document prediction command
- 6edd3f6d - Improve code, typing and logging
- 7dea8f1e - Improve code
- 21f27601 - Deal with unknown token separately
- 26cb20bf - Prepare language files for word and subword LM
- b8457431 - Add nltk and sentencepiece to requirements
- 41fb23c0 - Use the same space token as sentencepiece
- 4e91958f - Write tests
- 69d92b9c - Fix linting
- 557c8489 - Fix numpy import
Toggle commit list-
29b43932 - 1 commit from branch
added 2 commits
added 37 commits
-
f0e15553...568e5880 - 2 commits from branch
main
- 568e5880...e1165b7b - 25 earlier commits
- 7712a2e9 - Use the same space token as sentencepiece
- 46d044cc - Write tests
- beee846d - Fix linting
- fb29ce22 - Fix numpy import
- c9dedfa1 - Fix rebase errors
- 9ccc888d - Fix doc
- 01f849de - Add vocabulary size parameter for subword tokenizer
- c54e717b - Write tests for data extraction with subword and word tokenization
- 510682db - Update documentation
- 619e1c5c - Replace linebreaks with spaces for LM
Toggle commit list-
f0e15553...568e5880 - 2 commits from branch
requested review from @yschneider
Here are the main changes:
- Dependency to
nltk
(word tokenizer) &sentencepiece
(subword tokenizer) - Training corpus tokenized at the end of data extraction (because the subword tokenizer has to be trained on the full training corpus)
- A special
Tokenizer
class to handle character/subword/word tokenization - Additional files for word & subwords language models created in
language_model/
- for subword LM: sentencepiece tokenizer, corpus, lexicon
- for word LM: corpus, lexicon
- Updated documentation + examples with character, subword and word-level language models
- Dependency to
- Resolved by Solene Tarride
- Resolved by Solene Tarride
- Resolved by Solene Tarride
- Resolved by Solene Tarride
- Resolved by Solene Tarride
- Resolved by Solene Tarride
- Resolved by Solene Tarride
- Resolved by Solene Tarride
- Resolved by Solene Tarride
- Resolved by Solene Tarride
- Resolved by Solene Tarride
- Resolved by Solene Tarride
- Resolved by Solene Tarride
- Resolved by Solene Tarride
- Resolved by Solene Tarride
added 41 commits
-
ed9296bb...a267de74 - 2 commits from branch
main
- a267de74...2d851ccf - 29 earlier commits
- 8c7dc8fe - Fix rebase errors
- d6dcd979 - Fix doc
- 1611527b - Add vocabulary size parameter for subword tokenizer
- 72bf711f - Write tests for data extraction with subword and word tokenization
- 0c471127 - Update documentation
- 1ade7b11 - Replace linebreaks with spaces for LM
- e972ac44 - Update docstring
- 8269a94d - Simplify code
- 23cd8204 - Improve code
- de28eb7f - Store ARPA model with git-lfs
Toggle commit list-
ed9296bb...a267de74 - 2 commits from branch
I tested on Hugin Munin 10k at line-level and found some problems with
lexicon_subwords.txt
when trying to decode.-
ValueError: Unknown entry in dictionary: '⁄'
→ this token is not in the training set and should not appear in the lexicon
⁄ ⁄
-
ValueError: Unknown entry in dictionary: '́'
→ this token is not in the training set and should not appear in the lexicon + the two tokens should be separated by a space
́ ́
-
ValueError: Unknown entry in dictionary: '̊'
→ this token is not in the training set and should not appear in the lexicon + the two tokens should be separated by a space
̊ ̊
This is probably due to sentencepiece, I will investigate.
-
added 2 commits
added 1 commit
- d15506a4 - Add the unknown character to the list of tokens
added 2 commits
added 48 commits
-
a572b293...307df1b7 - 3 commits from branch
main
- 307df1b7...0b1af313 - 35 earlier commits
- 5a228234 - Update docstring
- c19f9607 - Simplify code
- d201e55d - Improve code
- 18a8ccb9 - Store ARPA model with git-lfs
- 9e3cde53 - Map unknown characters
- 87663643 - Encode text before checking for unknown characters
- c23b608a - Add the unknown character to the list of tokens
- e24f36c5 - Add unknown token to charset
- 82dab7a5 - Remove duplicate unknown token in tokens.txt
- d1eeb307 - Fix tests
Toggle commit list-
a572b293...307df1b7 - 3 commits from branch
- Resolved by Yoann Schneider
Now mapping tokens that are not in the charset to
unknown_token
(⁇
).
added 47 commits
-
d1eeb307...095667f4 - 2 commits from branch
main
- 095667f4...aab05828 - 35 earlier commits
- 12928ec0 - Update docstring
- 3ca6578c - Simplify code
- 2fac7868 - Improve code
- 6cf23a2d - Store ARPA model with git-lfs
- 0f64242b - Map unknown characters
- 1e3d7156 - Encode text before checking for unknown characters
- 7397a3bd - Add the unknown character to the list of tokens
- 5f5c1dcf - Add unknown token to charset
- 230d6cb0 - Remove duplicate unknown token in tokens.txt
- ef07d298 - Fix tests
Toggle commit list-
d1eeb307...095667f4 - 2 commits from branch
- Resolved by Yoann Schneider
enabled an automatic merge when the pipeline for 1ac1fc7a succeeds
Please register or sign in to reply