Skip to content

Remove unsupported characters from PDF text

https://redmine.teklia.com/issues/10933

build_transcription should filter out null characters (\0) as well as any character within the U+D800-U+DFFF range (UTF-16 surrogate characters), as they are not allowed anywhere in the Arkindex API. If, after trimming whitespace, the resulting text is an empty string, then this function should return None to skip the transcription.