-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(🫵 Input needed) Think about how using a RLHF-style proxy reward on top of the VLM influences our plans #6
Comments
My thoughts:
It seems that 4, 5 are (very) nice to have for the presentation, but depend on 1, 2, 3, so we should focus on those first anyway. However, rather than spending a lot of time on making 3 as nice as possible, for example, we could do 4, 5 instead. |
I think there is no sense in creating preference pairs from raw scores of individually evaluated videos -- the whole point of using pairs is that humans (and supposedly VLMs) are often more consistent at comparing two images/videos, rather than assigning a score from 0 to 1. If we compute scores individually, we lose this effect. Also, I think that if our data collection procedure is "gather pairs (trajectory, its thorough text description)", then we can easily pivot to comparison-based approach later. |
I'm not sure I agree with the first part, though we probably do agree on all the practical issues (second part) which is what matters. Let's focus on 1, 2, 3 from my original comment and see about 4,5 later. As to why I don't agree — even for feeding the model single examples I can see a model being bad for getting out raw numbers, e.g. because it is noisy, but being good at producing an ordering, esp. if we train another function on top of it (which presumably gets rid of some of the noise in the orderings, too). |
If we are to believe the RL-VLM-F paper, if we get to actual training, we might want to use the VLM to train a proxy reward function instead of using the VLM directly as a reward.
The reward is then based on querying the VLM for preferences over a pair of images (videos, in our case).
This fact (us caring about VLM preferences more than the actual numbers we get from it) might influence how we generate the trajectories #3 and visualize the results #5, hence I created this issue.
The text was updated successfully, but these errors were encountered: