Change white-space setting for transcription entities
Closes #797 (closed)
After some testing and some reading of the docs on white-space
, it turns out that just removing white-space
would cause empty lines to be removed, as well as leading or trailing spaces. This could make troubleshooting harder for NER worker developers and would not represent the exact transcription as it is stored in the backend. break-spaces
will both preserve all spaces as they are, and also word wrap on anything that counts in Unicode as whitespace, which gives the most opportunities for wrapping to try to avoid the overflow issues. But the real issue is not in the regular text, it is in any text inside of a TranscriptionEntity.
I looked at some edge cases such as this one where multiple TrancriptionEntity span multiple lines. There was, in that component, an attempt to deal with this situation by just removing line break; it was actually only removing the first line break so it was not really useful. I changed it so it could replace every single line break with a space, and also tried to just remove it to see what happens.
master
Whitespace is preserved everywhere with the exception of the very first line break of each Token
(each block of text between entities or inside an entity), which ends up removing most of the line breaks.
Preserving newlines inside entities
Whitespace is preserved as much as possible. Text is only wrapped when it reaches the entire length of the blockquote
, so any line breaks seen inside entities are actual line breaks found in the transcription's text.
Removing newlines inside entities
Whitespace is preserved entirely outside of entities, then within any entity, all line breaks are removed. This can still result in entities that span multiple lines. Note that some entities that were previously separated by a line break are now completely stuck together.
I don't think we can really say "we do not support line breaks", because it is definitely possible to have in a normal transcription an entity that ends on the next line:
Hello my name is Brian
^^^^^
of the Kitchen.
^^^^^^^^^^^^^^
Another possibility would be to update parseEntities
so that it creates new tokens that would be nothing other than line breaks and split TranscriptionEntities by line. This could still create some confusion since we have some line wrapping enabled.
Maybe the only real option is to remove line breaks absolutely everywhere until the user clicks a Wrap lines
switch that causes the blockquote
to get a horizontal scrollbar?