Skip to content

Change white-space setting for transcription entities

Erwan Rouchet requested to merge transcription-entity-white-space into master

Closes #797 (closed)

After some testing and some reading of the docs on white-space, it turns out that just removing white-space would cause empty lines to be removed, as well as leading or trailing spaces. This could make troubleshooting harder for NER worker developers and would not represent the exact transcription as it is stored in the backend. break-spaces will both preserve all spaces as they are, and also word wrap on anything that counts in Unicode as whitespace, which gives the most opportunities for wrapping to try to avoid the overflow issues. But the real issue is not in the regular text, it is in any text inside of a TranscriptionEntity.

I looked at some edge cases such as this one where multiple TrancriptionEntity span multiple lines. There was, in that component, an attempt to deal with this situation by just removing line break; it was actually only removing the first line break so it was not really useful. I changed it so it could replace every single line break with a space, and also tried to just remove it to see what happens.

master

Whitespace is preserved everywhere with the exception of the very first line break of each Token (each block of text between entities or inside an entity), which ends up removing most of the line breaks.

image

Preserving newlines inside entities

Whitespace is preserved as much as possible. Text is only wrapped when it reaches the entire length of the blockquote, so any line breaks seen inside entities are actual line breaks found in the transcription's text.

image

Removing newlines inside entities

Whitespace is preserved entirely outside of entities, then within any entity, all line breaks are removed. This can still result in entities that span multiple lines. Note that some entities that were previously separated by a line break are now completely stuck together.

image


I don't think we can really say "we do not support line breaks", because it is definitely possible to have in a normal transcription an entity that ends on the next line:

Hello my name is Brian
                 ^^^^^
of the Kitchen.
^^^^^^^^^^^^^^

Another possibility would be to update parseEntities so that it creates new tokens that would be nothing other than line breaks and split TranscriptionEntities by line. This could still create some confusion since we have some line wrapping enabled.

Maybe the only real option is to remove line breaks absolutely everywhere until the user clicks a Wrap lines switch that causes the blockquote to get a horizontal scrollbar?

2021-10-21-174624_558x806_scrot

Edited by Erwan Rouchet

Merge request reports

Loading