The inference output order becomes strange after around frame 167 for my video. You can confirm it by script/test_video.sh in [this](https://github.com/antonilo/unsupervised_detection/pull/24) pull request.