Train with synthetic documents
In the original code, you had to define one function for each dataset to generate synthetic documents. I want to implement a simple synthetic paragraph generator that concatenate synthetic lines together.
This will be useful to train DAN from scratch using synthetic lines/documents and curriculum learning. Here is the pre-training procedure:
- train on synthetic lines
- progressively increase the number of lines (= train on synthetic paragraphs)
- progressively increase the % of real documents
Edited by Solene Tarride