evaluation on partial test set #23

rahman-mdatiqur · 2020-10-30T21:21:18Z

It seems that, you are not doing evaluation on the THUMOS'14 full test set. As you reported in your paper, THUMOS'14 detection task is evaluated on 213 test videos. But, your test window_info.log file is missing window info for the following 3 test videos, as your thumos14_test_annotation.csv is missing annotations for those videos. As a result, you are basically evaluating your model on 210 test videos instead of 213.

video_test_0000270
video_test_0001292
video_test_0001496

Can you please comment on why this is the case?

Thanks much.

HYPJUDY · 2020-11-04T03:06:27Z

Hi @rahman-mdatiqur , I am happy to clarify this issue.

TL; DR: The original source of thumos14_test_annotation.csv leaves these three videos out, because it seems that these three videos have incorrect annotations which are longer than their duration.

Details:

thumos14_test_annotation.csv is generated by data/gen_thumos14_anno.py, which is based on SSN. (see Preprocess Data by Yourself)
The ground truth file from SSN is thumos14_tag_proposal_list/thumos14_tag_test_proposal_list.txt, which do not contains these three videos. (see data/gen_thumos14_anno.py)
I also have noticed this issue two years ago (see About proposal list (no CliffDiving class in val list, missing video_test_0001496 in test list) yjxiong/action-detection#31) and would like to share more detailed information on these three videos:

thumos14_tag_test_normalized_proposal_list.txt has 200 videos while there are 213 videos in TH14_Temporal_Annotations_Test\xgtf_renamed. Two reasonable missing videos are video_test_0000270 (its annotationa are HammerThrow but its ground truth in video is HairCut which doesn't belong to the 20 classes) and video_test_0001292 (it only has ambiguous annotations).
It seems that another missing video video_test_0001496 can be included into test list after modifying the annotations (annotations are CricketShot while ground truth is FrisbeeCatch).

Another related issue FYI: issue about the thumos14_test_normalized_proposal_list.txt yjxiong/action-detection#13
The original replies to this issue in SSN:

Their annotation seems to be longer than the duration of the videos. It is better to leave them out.

If I remember correctly, these 3 videos have incorrect annotations which are sitting beyond the videos’ time span.

rahman-mdatiqur · 2020-11-04T16:26:49Z

Hello @HYPJUDY,

Thanks again for your quick and wonderful response.

It eventually directs me to raise the following concern.

Since you do not mention in your paper that you are evaluating on 210 videos instead of 213, how does it fare to compare your method in Table. 2 against other SOTA methods that report results on 213 test videos? I mean, does leaving those test videos out from the evaluation give you any advantage over the other SOTA methods in terms of mAP? I know that you are not removing the corresponding annotations from the ground-truth annotations located in https://github.com/HYPJUDY/Decouple-SSAD/tree/master/EvalKit/THUMOS14_evalkit_20150930/annotation. But, I did not check the evaluation script to see if it would be advantageous/disadvantageous to leave some videos out from the evaluation set.

Can you please comment on this?

Thanks in advance.

HYPJUDY · 2020-11-05T12:39:45Z

Hi @rahman-mdatiqur , thanks for your good question.
I only a quick look on the evaluation script since I do not have enough time currently.

Decouple-SSAD/EvalKit/THUMOS14_evalkit_20150930/TH14evalDet.m

Line 135 in e3b6539

videonames=unique(cat(2,gtvideonames,detvideonames));

It seems that the code only evaluate the common videos of ground truths and detected results. So if the model can produce good (bad) results for these three videos, then the map should be better (worse) by incorporating their results.
If your code is ready, you can quickly validate these by ablation experiments:

Remove more videos from the result file to see the change of performance.
Remove the ground truth annotations from the EvalKit to see the change of perfomance.
Including the evaluation result of these three videos to see the change of performance.

I think if the annotations of some videos are obviously wrong, then we should exclude them. Otherwise the overall result is not correct and the evaluation on these wrong annotated videos is meaningless. I should have clarified the video number (210) in paper. Thanks for you reminding.

rahman-mdatiqur · 2020-11-06T17:15:30Z

Thanks @HYPJUDY for suggesting ways to evaluate the effect of excluding videos from the predictions list. As you said, since doing good(bad) on these videos may improve(downgrade) final [email protected], and since the SOTA methods report results on all 213 videos without making any modifications to the ground-truth annotations, I believe, new methods should follow the same protocol when comparing with SOTA methods, or mention the video numbers while comparing.

Thanks much for all the thoughts and helpful feedback.

HYPJUDY · 2020-11-07T10:26:32Z

You are welcome!

rahman-mdatiqur changed the title ~~evaluation does not include the full test set~~ evaluation on partial test set Oct 30, 2020

HYPJUDY closed this as completed Nov 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

evaluation on partial test set #23

evaluation on partial test set #23

rahman-mdatiqur commented Oct 30, 2020 •

edited

Loading

HYPJUDY commented Nov 4, 2020

rahman-mdatiqur commented Nov 4, 2020

HYPJUDY commented Nov 5, 2020

rahman-mdatiqur commented Nov 6, 2020

HYPJUDY commented Nov 7, 2020

evaluation on partial test set #23

evaluation on partial test set #23

Comments

rahman-mdatiqur commented Oct 30, 2020 • edited Loading

HYPJUDY commented Nov 4, 2020

rahman-mdatiqur commented Nov 4, 2020

HYPJUDY commented Nov 5, 2020

rahman-mdatiqur commented Nov 6, 2020

HYPJUDY commented Nov 7, 2020

rahman-mdatiqur commented Oct 30, 2020 •

edited

Loading