-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clarifying Details to Reproduce on THUMOS14 #23
Comments
A1. It is true , this approach will lead to over generalization of long action instances. We need to change the inference script and not include the filling operation. This is a drawback as we do not have start/end regressors. Start/End of action = start/end of mask. A2. Normally all action follows a gaussian curve, i.e action onset/offset ( boundary regions ) have high chance of misclassification whereas the centre of the action segment it has higher chance of classifying. Since our snippet duration is small we try to approximate the GT closer to the centre to avoid misclassification which might hurt particularly in Zero-Shot setting. |
Thank you for your reply! Q1. Then, when you obtained the results of THUMOS14 in a paper, have you used the filling operation? I really appreciate your clarification for me to reproduce the result on THUMOS14. Q3. Question regarding class-agnostic representation masking step. (Code about Q3-2) https://github.com/sauradip/STALE/blob/main/stale_model.py#L213
(Code about Q3-3) https://github.com/sauradip/STALE/blob/main/stale_model.py#L58
In the figure of the paper, there are several action queries used for mask decoding. |
A1. I did not use filling for THUMOS. A3-1. Both can be used, yes for ActivityNet, it comes out more consistent than maskformer ( if i use 1 query ). If i use more query then the maskformer output can be used there. The reason to use 1 query is because of memory constraints in GPU. i was testing this on 1 GPU and if i increase number of queries it is heavy on compute. One important change you need to do if you have > 1 query is to pass through a extra 1-D conv to map from many to one. i.e N x T x D to 1 x T x D. These operations is not done in this code due to memory constraints. A3-2. We are selecting those probabilites where the probability is > than some mean(temporal probabilities). I observed that for all the videos since the videos are short and longer foreground, using this approach it covers majority of the foreground empirically, however, for thumos where it is long videos and short foreground, this may not work, you may need to put a high threshold to classify the foreground indexes. I used 0.55 for THUMOS. A3-3. I answered partly in A3-1. You can use multiple queries for THUMOS. We used 30 queries in our testing version for THUMOS. |
Thank you for your reply! Q3-1. The result of the MaskFormer code provided by you is a foreground logit score whose shape is B x nq x L [batch_size x num_queries x num_features (=video length)].
Q4. I also want to know about the details of "a stack of three 1-D dynamic convolution layers H_m" mentioned on page 8, such as kernel size or K values and other params for dynamic convolution. Although the paper states that the details are on the supplementary material, the link in ECCV`22 webpage shows your main paper, not the supplementary material. Sorry for interrupting your time. Q5. For the label assignment, only one class label can be assigned to the snippet and video. However, some THUMOS14 videos include overlapped action instances of different classes. Is it correct that this work does not consider such a case? |
Based on the available information, I can only achieve 2 mAP on THUMOS14 with the closed setting. The main bottleneck seems training action mask localizer part which is failed to be supervised using the dynamic conv and losses you stated in the paper. I think predicting the global mask (only the action instance in the current time) is a very hard task with only convolution layers. Also, there are too many background masks in the THUMOS14 which hinders the learning of the model as well. May I ask how did you deal with this problem? And, how can you achieve 44.6 mAP of THUMOS14. Also, why don't you use MaskFormer foreground mask directly for the output mask which seems working very well? I attached the results of my implemented ones below. |
Hi, Your 2D action mask (250x250) looks not that bad, I can see initial two masks at top left , some in the middle and then some in bottom right , that is how the 1D GT action mask looks like. The 2 D mask is not expected to be clean ! It has not 0 probability for background. Hence noise has to be expected. How to clean the noise is a trick you can say. Some tricks like action segment in THuMoS cannot be shorter than 2 snippets or cannot be max than 50 snippets ; check the score of one / two rows from the 2D mask, see what probability it shows for foreground , one thing I remember is mean() does not work for thumos in thresholding because majority background. ; you can use the soft 1D mask to check the result and let me know how good the result is
|
Thanks for getting back to me! I'm a bit worried about the big differences between prediction and GT mask map. In each column, there seem to be too many points marked as foreground, even in the background column. Also, the parts you stated (e.g., top-left and middle) have lower foreground scores compared to other points in the columns - leading to inaccurate localization and suppression while SoftNMS. I'm starting to wonder if we can really get the same results as what the research paper claims... I planned to experiment with this model in other TAL datasets but even reproducing the results on THUMOS14 is challenging to achieve.
|
Hi , I will recommend you to check with the UNet value as a class score refinement just like it is done here for activitynet postprocessing since it is a standard followed for fair comparison. You can find the Unet score for Thumos from GTAD repository and just paste in stale best score json to check. |
@HYUNJS Hi can we exchange? I can achieve 20mAP with a 50:50 split mAP using a 2-stage approach (I3D features + uNet results), but by replacing the uNet with CLIP, the result drops to 10mAP. I would like to know how to achieve a higher result without uNet |
Hi everyone :) |
While I was reproducing the accuracy on the THUMOS14 dataset, some of your implementations were confusing. I would really appreciate your clarification for me to reproduce the results.
Q1.
In the inference time, segments above the threshold are connected to form one large segment as shown in the below figure. Although this is the effective post-processing method for the ActivityNet dataset, this is not true for the THUMOS14 dataset which has many short action instances rather than one/two long action instances.
https://github.com/sauradip/STALE/blob/main/stale_inference.py#L156
Q2.
In the dataset builder, why do you add 1 and minus 1 for start and end indices, respectively?
https://github.com/sauradip/STALE/blob/main/stale_lib/stale_dataloader.py#L188
The text was updated successfully, but these errors were encountered: