Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(🫵 Input needed) Think about how using a RLHF-style proxy reward on top of the VLM influences our plans #6

Closed
Eugleo opened this issue Mar 4, 2024 · 3 comments

Comments

@Eugleo
Copy link
Owner

Eugleo commented Mar 4, 2024

If we are to believe the RL-VLM-F paper, if we get to actual training, we might want to use the VLM to train a proxy reward function instead of using the VLM directly as a reward.

The reward is then based on querying the VLM for preferences over a pair of images (videos, in our case).

This fact (us caring about VLM preferences more than the actual numbers we get from it) might influence how we generate the trajectories #3 and visualize the results #5, hence I created this issue.

@Eugleo
Copy link
Owner Author

Eugleo commented Mar 4, 2024

My thoughts:

  1. We generate a bunch of trajectories where a task is and isn't being done.
  2. For each trajectory we obtain a score from the model (e.g. similarity to the description of the task)
  3. We report the raw scores in a nice plot (this is what we're doing now)
  4. We find a principled way to use the raw scores of two videos to generate a preference label for the pair (-1, 0, 1)
  5. We report the statistics of the preferences in a nice plot, too.

It seems that 4, 5 are (very) nice to have for the presentation, but depend on 1, 2, 3, so we should focus on those first anyway. However, rather than spending a lot of time on making 3 as nice as possible, for example, we could do 4, 5 instead.

@Eugleo Eugleo moved this from Todo to Waiting for Input in MATS Mar 4, 2024
@Eugleo Eugleo moved this from Waiting for Input to Priority in MATS Mar 4, 2024
@Eugleo Eugleo changed the title Think about how best to test whether a model could be used for preference-based RLHF-style training (Input needed) Think about how using a RLHF-style proxy reward on top of the VLM influences our plans Mar 4, 2024
@Eugleo Eugleo changed the title (Input needed) Think about how using a RLHF-style proxy reward on top of the VLM influences our plans (⚠️ Input needed) Think about how using a RLHF-style proxy reward on top of the VLM influences our plans Mar 5, 2024
@Eugleo Eugleo changed the title (⚠️ Input needed) Think about how using a RLHF-style proxy reward on top of the VLM influences our plans (🫵 Input needed) Think about how using a RLHF-style proxy reward on top of the VLM influences our plans Mar 5, 2024
@Dont-Care-Didnt-Ask
Copy link
Collaborator

I think there is no sense in creating preference pairs from raw scores of individually evaluated videos -- the whole point of using pairs is that humans (and supposedly VLMs) are often more consistent at comparing two images/videos, rather than assigning a score from 0 to 1. If we compute scores individually, we lose this effect.

Also, I think that if our data collection procedure is "gather pairs (trajectory, its thorough text description)", then we can easily pivot to comparison-based approach later.

@Eugleo
Copy link
Owner Author

Eugleo commented Mar 5, 2024

I'm not sure I agree with the first part, though we probably do agree on all the practical issues (second part) which is what matters. Let's focus on 1, 2, 3 from my original comment and see about 4,5 later.

As to why I don't agree — even for feeding the model single examples I can see a model being bad for getting out raw numbers, e.g. because it is noisy, but being good at producing an ordering, esp. if we train another function on top of it (which presumably gets rid of some of the noise in the orderings, too).

@Eugleo Eugleo closed this as completed Mar 7, 2024
@github-project-automation github-project-automation bot moved this from Priority to Done in MATS Mar 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

No branches or pull requests

2 participants