Half precision inference

This could reduce inference time significantly. Experiments have shown that performance doesn't suffer too much from it.

We could rely on torch.autocast. Be careful not to break CPU inference.