-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training NMN model on full DROP (with numerical answers) #10
Comments
Our current model is only capable of handling limited numerical operations, such as comparisons, min/max, counting. Since DROP requires a much more diverse set of reasoning, it is not surprising to see such a low performance on all of number-answer DROP questions. The program and aux-module-output supervision is generated for a few types of questions (in the iclr paper) based on heuristics and hence you only see supervision for a partial set. "the final model is being able to only use these for training" -- I suspect this is happening due to curriculum learning; the default Your scripts for generating the data seem fine; the issue is the limitation of the current model in its reasoning capability. Hope this helps. Please let me know if you have additional questions. |
Also see our ACL paper on why performance on the ICLR subset is very likely worse when you train the model on all of these questions. (As ACL is ongoing right now, here's a link to the conference page for the paper.) |
Thanks a lot for clarifying my queries and thanks for the pointer to the ACL work!. This was really helpful. |
That indeed is strange. Did you try evaluating the full model on the Could you share your data filtering script and I can try to train it myself and see what the issue might be. |
Yes by evaluating the pretrained model on the numerical answer subset i got ~65 F1. My data filtering script is simply based on the gold answer, i.e. in the raw DROP json file where answer["number"] is a non-empty string. It would be great if you can kindly check if you are getting similar result. |
Sure, I'll try and let you know. |
Hi, I was able to get 58% on the DROP numerical answer dev set when all supervisions are in place. I think the issue was that even though "filter_for_epochs" was set to 5, it was still using the filtered instances after epoch 5. Thanks for all your help! |
That sounds great. That is really strange; two questions, (a) how did you fix it? (b) I am actually getting a dev performance of 48 F1; I have ~47.3k training / ~5.8k dev instances. Does that seem right? |
(b) Yes the training and dev data size is right. Is this 48% with all the types of supervision in place? And this is after running how many epochs? |
(b) I think yes, I did a quick fix by only taking in the HowManyYards, YearDiff, and Count questions w/ supervision. Maybe I made a mistake there somehow. This was after ~17 epochs (a) That is strange; I don't see a reason for that happening. But glad you figured it out. |
I have some queries regarding retraining your model on the subset of DROP that have only numerical answers. This subset has ~45K training instances but the preprocessing scripts are able to generate additional supervision only for ~12K training instances and the final model is being able to only use these for training (the remaining instances are not used).
Specifically I am interested in training a version without 'execution_supervised' or 'qattn_supervised'. In all of the cases (by removing one of these supervisions or both) the validation results are quite poor (around 13-15% Exact Match). Trying different learning rates and beam sizes also did not improve.
Also by turning on all the kinds of supervision in the config file, on this subset of data, i am getting poor validation performance of ~20% (Exact Match) . Does this sound reasonable or am i doing something wrong? It would be great if you can shed some light on this.
I am using the tokenize.py code followed by preprocessing scripts in datasets/drop/preprocess to generate the different types of supervision in the training data and use merge_data.py to get the final training data.
At the end, i get the following supervision from the training data {'program_supervised': 9669, 'qattn_supervised': 7260, 'execution_supervised': 1750}).
For your reference i am attaching the scripts i used for generating this supervision for the training data, Can you kindly let me know if i am doing it the right way.
generate_annotations.txt
The text was updated successfully, but these errors were encountered: