Support subword and word language models
Merge request reports
Activity
added P1 label
assigned to @starride
added 32 commits
-
dcf4925f...04809033 - 3 commits from branch
main
- 04809033...57684efb - 19 earlier commits
- c64f30f2 - Document prediction with language model
- 2a95fb27 - Document prediction command
- 9a638552 - Improve code, typing and logging
- c0967ee5 - Improve code
- 00d54e0c - Deal with unknown token separately
- 3c663e40 - Prepare language files for word and subword LM
- f5679edf - Add nltk and sentencepiece to requirements
- 0bd6e048 - Use the same space token as sentencepiece
- 13220228 - Write tests
- a74d6cf3 - Fix linting
Toggle commit list-
dcf4925f...04809033 - 3 commits from branch
added 31 commits
-
29b43932 - 1 commit from branch
main
- 29b43932...4353f420 - 20 earlier commits
- 03d00d45 - Document prediction command
- 6edd3f6d - Improve code, typing and logging
- 7dea8f1e - Improve code
- 21f27601 - Deal with unknown token separately
- 26cb20bf - Prepare language files for word and subword LM
- b8457431 - Add nltk and sentencepiece to requirements
- 41fb23c0 - Use the same space token as sentencepiece
- 4e91958f - Write tests
- 69d92b9c - Fix linting
- 557c8489 - Fix numpy import
Toggle commit list-
29b43932 - 1 commit from branch
added 2 commits
added 37 commits
-
f0e15553...568e5880 - 2 commits from branch
main
- 568e5880...e1165b7b - 25 earlier commits
- 7712a2e9 - Use the same space token as sentencepiece
- 46d044cc - Write tests
- beee846d - Fix linting
- fb29ce22 - Fix numpy import
- c9dedfa1 - Fix rebase errors
- 9ccc888d - Fix doc
- 01f849de - Add vocabulary size parameter for subword tokenizer
- c54e717b - Write tests for data extraction with subword and word tokenization
- 510682db - Update documentation
- 619e1c5c - Replace linebreaks with spaces for LM
Toggle commit list-
f0e15553...568e5880 - 2 commits from branch
requested review from @yschneider
Here are the main changes:
- Dependency to
nltk
(word tokenizer) &sentencepiece
(subword tokenizer) - Training corpus tokenized at the end of data extraction (because the subword tokenizer has to be trained on the full training corpus)
- A special
Tokenizer
class to handle character/subword/word tokenization - Additional files for word & subwords language models created in
language_model/
- for subword LM: sentencepiece tokenizer, corpus, lexicon
- for word LM: corpus, lexicon
- Updated documentation + examples with character, subword and word-level language models
- Dependency to
- Resolved by Solene Tarride
- Resolved by Solene Tarride
- Resolved by Solene Tarride
- Resolved by Solene Tarride
- Resolved by Solene Tarride
- Resolved by Solene Tarride
- Resolved by Solene Tarride
Please register or sign in to reply