Skip to content
Snippets Groups Projects

Support subword and word language models

Merged Solene Tarride requested to merge subword-and-word-lm into main
1 file
+ 1
1
Compare changes
  • Side-by-side
  • Inline
@@ -370,7 +370,7 @@ class ArkindexExtractor:
), "Tokens should be single characters."
# Build LM corpus
train_corpus = [text for text in self.data["train"].values()]
train_corpus = [text.replace("\n", " ") for text in self.data["train"].values()]
tokenizer = Tokenizer(
train_corpus,
outdir=self.output / "language_model",
Loading