Skip to content

New bulk endpoint to create transcription entities

We are publishing a lot more entities on transcriptions due to the amazing power of DAN (example on socface with hundreds of entities on a single transcriptions).

The current publication of these entities is slow, because it uses atomic endpoints to create each entity then each transcription entity (which is OK for smaller projects / pages).

We need a new bulk endpoint CreateTranscriptionEntities which would allow an API user to create a lot of entities on a single transcription.

The payload would be (all fields are required by default):

  • transcription_id: UUID of a transcription where all entities will be associated
  • worker_run_id: UUID of the worker run that publishes data ( no manual access allowed )
  • entities: list of entities and their positions:
    • type_id: UUID of the entity type
    • name: name of the entity
    • offset: offset of the entity on the transcription
    • length: length of the entity on the transcription
    • confidence: confidence score of the transcription entity

Fields we do not support:

  • CreateEntity.validated: is always set to True, no need to use that here
  • CreateEntity.corpus: we have that information from the transcription's element
  • CreateEntity.metas: we do not need it for socface, nor probably won't use it a lot with entity types available

The endpoint workflow would be:

  1. rights check (user has contributor access on the corpus) + worker run exists
  2. check if any entities exist on that transcription for this worker run:
    • raise a 400 if any entities already exist: we do not want to manage conflict here, just to be as fast as possible
  3. open db transaction:
    1. create all Entity instances through bulk create
    2. create all TranscriptionEntity instances through bulk create
  4. return the list of ID for created instances :
{
   "entities": [
     {
       "transcription_entity_id": "...",
       "entity_id": "..."
     }
   ]
}
Edited by Bastien Abadie