Add Language Model Decoder (!274) · Merge requests · Automatic Text Recognition / DAN · GitLab

Snippets Groups Projects

Merged Solene Tarride requested to merge lm-decoder into main 1 year ago

Closes #142 (closed)

Taking over from !222 (closed)

Activity

Solene Tarride changed milestone to %ML Prod - Next 1 year ago

changed milestone to %ML Prod - Next
Solene Tarride added P1 label 1 year ago

added P1 label
Solene Tarride assigned to @starride 1 year ago

assigned to @starride
Solene Tarride added 1 commit 1 year ago
added 1 commit

5155ec53 - Fix tests

Compare with previous version
Solene Tarride added 2 commits 1 year ago
added 2 commits

77e5ab49 - Add CTC frame between each frames

0fa4f058 - Improve documentation

Compare with previous version
Solene Tarride added 1 commit 1 year ago
added 1 commit

11db35ea - Generate LM files during data extraction

Compare with previous version
Solene Tarride added 1 commit 1 year ago
added 1 commit

74a5c4cf - Update tests for data extraction

Compare with previous version
Solene Tarride added 1 commit 1 year ago
added 1 commit

85f5ca8b - Fix extraction lint

Compare with previous version
Solene Tarride added 1 commit 1 year ago
added 1 commit

120dbbc7 - Fix shape of tot_prob

Compare with previous version
Solene Tarride added 1 commit 1 year ago
added 1 commit

a7a61df6 - Remove debug code

Compare with previous version
Solene Tarride added 1 commit 1 year ago
added 1 commit

c3cddad0 - Use named tuple for special tokens

Compare with previous version
Solene Tarride added 1 commit 1 year ago
added 1 commit

2863d16b - Write tests for data extraction

Compare with previous version
Solene Tarride added 1 commit 1 year ago
added 1 commit

a8639307 - Update tests for data extraction

Compare with previous version
Solene Tarride added 2 commits 1 year ago
added 2 commits

e4d0bee2 - Support batch_size>1

8130eae1 - Write tests for LM decoding

Compare with previous version
Solene Tarride added 1 commit 1 year ago
added 1 commit

c7883cfd - Fix CTC token probability

Compare with previous version
Solene Tarride added 1 commit 1 year ago
added 1 commit

a3d3370c - Use CTCHypothesis.tokens instead og CTCHypothesis.words

Compare with previous version
Solene Tarride added 2 commits 1 year ago
added 2 commits

bf09977d - Move tensor to correct device and trim prediction

0c7350af - Fix tests

Compare with previous version
Solene Tarride added 3 commits 1 year ago
added 3 commits

c2d38976 - Simplify and document data extraction

fd65c5e6 - Document prediction with language model

482981a3 - Document prediction command

Compare with previous version
Solene Tarride added 26 commits 1 year ago
added 26 commits

482981a3...5e109064 - 5 commits from branch main

5e109064...8d268f96 - 11 earlier commits

ee823c36 - Update tests for data extraction

f94e2acb - Support batch_size>1

a1551bc7 - Write tests for LM decoding

b23826ea - Fix CTC token probability

b9f4f3e8 - Use CTCHypothesis.tokens instead og CTCHypothesis.words

b59f73e8 - Move tensor to correct device and trim prediction

06f5ef65 - Fix tests

28aaa477 - Simplify and document data extraction

51219ae4 - Document prediction with language model

e28ddf96 - Document prediction command

Compare with previous version
Toggle commit list

Solene Tarride @starride · 1 year ago

Author Maintainer

Here are the main changes:

the teklia-dan dataset extract command now also generates resources to build an n-gram LM (default behavior)

output/
├── charset.pkl
├── labels.json
├── images
│   ├── train
│   ├── val
│   └── test
├── language_model
│   ├── corpus.txt
│   ├── lexicon.txt
│   └── tokens.txt

the teklia-dan predict command supports a new argument --use-language-model. Not that other LM parameters should be set in inference_parameters.yml

parameters:
  ...
  language_model:
    model: path/to/language_model.arpa
    lexicon: path/to/lexicon.txt
    tokens: path/to/tokens.txt
    weight: 1.0

I have documented the prediction example here.

Please register or sign in to reply