Skip to content

Decode without unk

Solene Tarride requested to merge decode-without-unk into master

Some tokens rarely/never appear in the training set, so they cannot be accurately recognized. To avoid this, we use the <unk> token to represent these tokens.

However, we need to prevent the network to predict the <unk> token, as it is always incorrect.

Example of problematic decoding: https://demo.arkindex.org/element/bd15be49-8c40-47a9-a627-991ae3754209?highlight=ac2ab5b4-63ea-4916-8f65-db550a5fda58 => <unk> Mange ved jo digre si

We now remove the <unk> token from the hypothesis.

Tested on https://demo.arkindex.org/element/bd15be49-8c40-47a9-a627-991ae3754209?highlight=ac2ab5b4-63ea-4916-8f65-db550a5fda58

Edited by Solene Tarride

Merge request reports

Loading