Skip to content
Snippets Groups Projects

Support subword and word language models

Merged Solene Tarride requested to merge subword-and-word-lm into main

Merge request reports

Loading
Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
  • Yoann Schneider
  • tests/data/prediction/language_model.arpa is not stored via Git-LFS

  • Solene Tarride added 1 commit

    added 1 commit

    Compare with previous version

  • Solene Tarride added 1 commit

    added 1 commit

    • ed9296bb - Store ARPA model with git-lfs

    Compare with previous version

  • Solene Tarride resolved all threads

    resolved all threads

  • Solene Tarride added 41 commits

    added 41 commits

    Compare with previous version

  • Author Maintainer

    I tested on Hugin Munin 10k at line-level and found some problems with lexicon_subwords.txt when trying to decode.

    • ValueError: Unknown entry in dictionary: '⁄' → this token is not in the training set and should not appear in the lexicon
    ⁄ ⁄
    • ValueError: Unknown entry in dictionary: '́' → this token is not in the training set and should not appear in the lexicon + the two tokens should be separated by a space
     ́ ́
    • ValueError: Unknown entry in dictionary: '̊' → this token is not in the training set and should not appear in the lexicon + the two tokens should be separated by a space
     ̊ ̊

    This is probably due to sentencepiece, I will investigate.

  • Solene Tarride marked this merge request as draft

    marked this merge request as draft

  • Solene Tarride added 2 commits

    added 2 commits

    • 58b0789a - Map unknown characters
    • 53364dd6 - Encode text before checking for unknown characters

    Compare with previous version

  • Solene Tarride added 1 commit

    added 1 commit

    • d15506a4 - Add the unknown character to the list of tokens

    Compare with previous version

  • Solene Tarride added 1 commit

    added 1 commit

    • 14c391e5 - Add unknown token to charset

    Compare with previous version

  • Solene Tarride added 2 commits

    added 2 commits

    Compare with previous version

  • Solene Tarride added 48 commits

    added 48 commits

    Compare with previous version

  • Solene Tarride marked this merge request as ready

    marked this merge request as ready

  • Solene Tarride added 47 commits

    added 47 commits

    Compare with previous version

  • Yoann Schneider resolved all threads

    resolved all threads

  • Solene Tarride added 1 commit

    added 1 commit

    Compare with previous version

  • Yoann Schneider approved this merge request

    approved this merge request

  • Yoann Schneider
  • Yoann Schneider resolved all threads

    resolved all threads

  • Yoann Schneider added 1 commit

    added 1 commit

    • 1ac1fc7a - Apply 1 suggestion(s) to 1 file(s)

    Compare with previous version

  • Yoann Schneider enabled an automatic merge when the pipeline for 1ac1fc7a succeeds

    enabled an automatic merge when the pipeline for 1ac1fc7a succeeds

  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Please register or sign in to reply
    Loading