Support subword and word language models (!287) · Merge requests · Automatic Text Recognition / DAN · GitLab

Snippets Groups Projects

Merged Solene Tarride requested to merge subword-and-word-lm into main 1 year ago

Closes #199 (closed)

Ref https://redmine.teklia.com/issues/4941

Activity

Solene Tarride added P1 label 1 year ago

added P1 label
Solene Tarride assigned to @starride 1 year ago

assigned to @starride
Solene Tarride added 5 commits 1 year ago
added 5 commits

dd7021e2 - Prepare language files for word and subword LM

36c658db - Add nltk and sentencepiece to requirements

2cd63f5f - Use the same space token as sentencepiece

2e0d7544 - Write tests

dcf4925f - Fix linting

Compare with previous version
Toggle commit list
Solene Tarride added 32 commits 1 year ago
added 32 commits

dcf4925f...04809033 - 3 commits from branch main

04809033...57684efb - 19 earlier commits

c64f30f2 - Document prediction with language model

2a95fb27 - Document prediction command

9a638552 - Improve code, typing and logging

c0967ee5 - Improve code

00d54e0c - Deal with unknown token separately

3c663e40 - Prepare language files for word and subword LM

f5679edf - Add nltk and sentencepiece to requirements

0bd6e048 - Use the same space token as sentencepiece

13220228 - Write tests

a74d6cf3 - Fix linting

Compare with previous version
Toggle commit list
Solene Tarride added 31 commits 1 year ago
added 31 commits

29b43932 - 1 commit from branch main

29b43932...4353f420 - 20 earlier commits

03d00d45 - Document prediction command

6edd3f6d - Improve code, typing and logging

7dea8f1e - Improve code

21f27601 - Deal with unknown token separately

26cb20bf - Prepare language files for word and subword LM

b8457431 - Add nltk and sentencepiece to requirements

41fb23c0 - Use the same space token as sentencepiece

4e91958f - Write tests

69d92b9c - Fix linting

557c8489 - Fix numpy import

Compare with previous version
Toggle commit list
Solene Tarride added 1 commit 1 year ago
added 1 commit

3c1a30ca - Fix rebase errors

Compare with previous version
Solene Tarride added 3 commits 1 year ago
added 3 commits

c0d6f936 - Fix doc

95570042 - Add vocabulary size parameter for subword tokenizer

413038b2 - Write tests for data extraction with subword and word tokenization

Compare with previous version
Solene Tarride added 2 commits 1 year ago
added 2 commits

550cf579 - Update documentation

f0e15553 - Replace linebreaks with spaces for LM

Compare with previous version
Solene Tarride added 37 commits 1 year ago
added 37 commits

f0e15553...568e5880 - 2 commits from branch main

568e5880...e1165b7b - 25 earlier commits

7712a2e9 - Use the same space token as sentencepiece

46d044cc - Write tests

beee846d - Fix linting

fb29ce22 - Fix numpy import

c9dedfa1 - Fix rebase errors

9ccc888d - Fix doc

01f849de - Add vocabulary size parameter for subword tokenizer

c54e717b - Write tests for data extraction with subword and word tokenization

510682db - Update documentation

619e1c5c - Replace linebreaks with spaces for LM

Compare with previous version
Toggle commit list
Solene Tarride added 1 commit 1 year ago
added 1 commit

369b51c6 - Update docstring

Compare with previous version
Solene Tarride added 1 commit 1 year ago
added 1 commit

5adbebd6 - Simplify code

Compare with previous version
Solene Tarride requested review from @yschneider 1 year ago

requested review from @yschneider
Solene Tarride @starride · 1 year ago

Author Maintainer
Here are the main changes:

Dependency to nltk (word tokenizer) & sentencepiece (subword tokenizer)

Training corpus tokenized at the end of data extraction (because the subword tokenizer has to be trained on the full training corpus)

A special Tokenizer class to handle character/subword/word tokenization

Additional files for word & subwords language models created in language_model/

for subword LM: sentencepiece tokenizer, corpus, lexicon

for word LM: corpus, lexicon

Updated documentation + examples with character, subword and word-level language models
Yoann Schneider @yschneider started a thread on an old version of the diff 1 year ago

Resolved 1 year ago by Solene Tarride
Last reply by Yoann Schneider 1 year ago

Yoann Schneider @yschneider started a thread on an old version of the diff 1 year ago

Resolved 1 year ago by Solene Tarride

Yoann Schneider @yschneider started a thread on an old version of the diff 1 year ago

Resolved 1 year ago by Solene Tarride

Yoann Schneider @yschneider started a thread on an old version of the diff 1 year ago

Resolved 1 year ago by Solene Tarride

Yoann Schneider @yschneider started a thread on an old version of the diff 1 year ago

Resolved 1 year ago by Solene Tarride

Yoann Schneider @yschneider started a thread on an old version of the diff 1 year ago

Resolved 1 year ago by Solene Tarride

Yoann Schneider @yschneider started a thread on an old version of the diff 1 year ago

Resolved 1 year ago by Solene Tarride

Please register or sign in to reply