Add support for prompting

This issue is meant to serve as the basis for the upcoming Merge request. I wanted to label it as "enhancement" or "new feature" but could not find such label; please guide me to the right process for you.

I created a first implementation of a basic, textual token-based, prompting of DAN, and would like to share it with you to ensure it is useful and can eventually be properly merged. And to hunt weird bugs as well…

Basically, for the user, we would have a new dataset format with an extra "prompt" field, which can be left blank if we want the original behavior. Here is an excerpt for a "post-OCR" use, where the prompt is the noisy prediction, and the target the actual ground truth:

{
    "test": {
        // first sample
        { 
            "image": "images/test/032ae384-77dd-4ec8-868a-4348dd74897f/017ddb7f-6a6d-4ecd-a0f5-71f13f99e8d4.jpg",
            "target": "\u24c291\u24c3 \u24baidem\u24bb \u24c011\u24c1 \u24bebois\u24bf",
            "prompt": "\u24baidem\u24bb \u24c011\u24c1 \u24bebois\u24bf"
        },
        // second sample
        { 
            "image": "images/test/032ae384-77dd-4ec8-868a-4348dd74897f/01ca2192-6b45-4ec5-9dbc-d1357a7365bb.jpg",
            "target": "\u24bc\u00a7\u24bd \u24baLeblanc Jean Joseph (les h\u00e9ritiers)\u2192\u00e0 ablon\u24bb \u24c029\u24c1 \u24beMaison\u24bf",
            "prompt": "\u24bc\u00a7\u24bd \u24baLellane Jean (les heritiers)\u2192 \u24c0ablon 29\u24c1 \u24beMasson\u24bf"
        },
        // ...
    },
    "val": {
        // ...
    },
    "train": {
        // ...
    }
}

ATR system prompting can be seen as a straightforward way to inject knowledge to improve predictions, with extra input tokens if needed. Here are some possible uses:

indicate which page should be processed (left of right), as in previous experiments from Teklia
indicate the start of each line, as in the Faster-DAN paper
indicate the structure of the document to recognize (for instance with the ordered, nested set of entity tags to fill) or its subclass (specific layout)
provide the transcription predicted at some iteration, opening the way to some iterative processing, hopefully removing the need for a post-OCR language model and eventually merging both models into a single one (albeit called multiple times) — more curriculum directions here

If this proves useful, ATR-prompting may open the way toward:

advanced prompting with geometrical regions or specific key/class (a bit like in modern detection architectures) or with injection of extra information from the encoder at the cross-attention stage (would require tweaking the encoder);
iterative analysis of documents, enabling some progressive refinement of the output, eventually removing the need for the system to perfectly "guess" some structuring elements until fully "seeing" them (again, much like the Faster-DAN idea);
1-shot layout analysis and transcription (would require some foundational training);
composition of specialized units (macro- vs micro-structure) a bit like what was once performed with handcrafted ~~2D grammars~~ expert systems…

Any comment on the value of such feature is welcomed here, and I'll try to provide more details about the implementation in the related MR.