(🫵 Input needed) Think about how using a RLHF-style proxy reward on top of the VLM influences our plans #6

Eugleo · 2024-03-04T21:27:35Z

If we are to believe the RL-VLM-F paper, if we get to actual training, we might want to use the VLM to train a proxy reward function instead of using the VLM directly as a reward.

The reward is then based on querying the VLM for preferences over a pair of images (videos, in our case).

This fact (us caring about VLM preferences more than the actual numbers we get from it) might influence how we generate the trajectories #3 and visualize the results #5, hence I created this issue.

Eugleo · 2024-03-04T22:42:00Z

My thoughts:

We generate a bunch of trajectories where a task is and isn't being done.
For each trajectory we obtain a score from the model (e.g. similarity to the description of the task)
We report the raw scores in a nice plot (this is what we're doing now)
We find a principled way to use the raw scores of two videos to generate a preference label for the pair (-1, 0, 1)
We report the statistics of the preferences in a nice plot, too.

It seems that 4, 5 are (very) nice to have for the presentation, but depend on 1, 2, 3, so we should focus on those first anyway. However, rather than spending a lot of time on making 3 as nice as possible, for example, we could do 4, 5 instead.

Dont-Care-Didnt-Ask · 2024-03-05T17:04:51Z

I think there is no sense in creating preference pairs from raw scores of individually evaluated videos -- the whole point of using pairs is that humans (and supposedly VLMs) are often more consistent at comparing two images/videos, rather than assigning a score from 0 to 1. If we compute scores individually, we lose this effect.

Also, I think that if our data collection procedure is "gather pairs (trajectory, its thorough text description)", then we can easily pivot to comparison-based approach later.

Eugleo · 2024-03-05T18:40:36Z

I'm not sure I agree with the first part, though we probably do agree on all the practical issues (second part) which is what matters. Let's focus on 1, 2, 3 from my original comment and see about 4,5 later.

As to why I don't agree — even for feeding the model single examples I can see a model being bad for getting out raw numbers, e.g. because it is noisy, but being good at producing an ordering, esp. if we train another function on top of it (which presumably gets rid of some of the noise in the orderings, too).

Eugleo added this to MATS Mar 4, 2024

Eugleo converted this from a draft issue Mar 4, 2024

This was referenced Mar 4, 2024

Generate trajectories (suggest additional ones in comments) #3

Open

Suggest and implement improvements to result visualization #5

Closed

Eugleo moved this from Todo to Waiting for Input in MATS Mar 4, 2024

Eugleo moved this from Waiting for Input to Priority in MATS Mar 4, 2024

Eugleo changed the title ~~Think about how best to test whether a model could be used for preference-based RLHF-style training~~ (Input needed) Think about how using a RLHF-style proxy reward on top of the VLM influences our plans Mar 4, 2024

Eugleo changed the title ~~(Input needed) Think about how using a RLHF-style proxy reward on top of the VLM influences our plans~~ (⚠️ Input needed) Think about how using a RLHF-style proxy reward on top of the VLM influences our plans Mar 5, 2024

Eugleo changed the title ~~(⚠️ Input needed) Think about how using a RLHF-style proxy reward on top of the VLM influences our plans~~ (🫵 Input needed) Think about how using a RLHF-style proxy reward on top of the VLM influences our plans Mar 5, 2024

Eugleo closed this as completed Mar 7, 2024

github-project-automation bot moved this from Priority to Done in MATS Mar 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(🫵 Input needed) Think about how using a RLHF-style proxy reward on top of the VLM influences our plans #6

(🫵 Input needed) Think about how using a RLHF-style proxy reward on top of the VLM influences our plans #6

Eugleo commented Mar 4, 2024 •

edited

Loading

Eugleo commented Mar 4, 2024 •

edited

Loading

Dont-Care-Didnt-Ask commented Mar 5, 2024

Eugleo commented Mar 5, 2024

(🫵 Input needed) Think about how using a RLHF-style proxy reward on top of the VLM influences our plans #6

(🫵 Input needed) Think about how using a RLHF-style proxy reward on top of the VLM influences our plans #6

Comments

Eugleo commented Mar 4, 2024 • edited Loading

Eugleo commented Mar 4, 2024 • edited Loading

Dont-Care-Didnt-Ask commented Mar 5, 2024

Eugleo commented Mar 5, 2024

Eugleo commented Mar 4, 2024 •

edited

Loading

Eugleo commented Mar 4, 2024 •

edited

Loading