Skip to content

Train with synthetic documents

In the original code, you had to define one function for each dataset to generate synthetic documents. I want to implement a simple synthetic paragraph generator that concatenate synthetic lines together.

This will be useful to train DAN from scratch using synthetic lines/documents and curriculum learning. Here is the pre-training procedure:

  • train on synthetic lines
  • progressively increase the number of lines (= train on synthetic paragraphs)
  • progressively increase the % of real documents
Edited by Solene Tarride