Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Damage not being handled correctly #158

Closed
Nefsen402 opened this issue Jul 24, 2021 · 12 comments
Closed

Damage not being handled correctly #158

Nefsen402 opened this issue Jul 24, 2021 · 12 comments

Comments

@Nefsen402
Copy link

When recording the screen using pipewire and only when using certain clients, every other frame is dropped and not visible in the output. I have recorded a video with obs and the foot terminal on arch linux to demonstrate the issue. At first, I am typing a in a consistent rhythmic manner, but you only see aa pop up meaning it skipped the frame between when the two a's were being typed. The issue exists with both obs and webrtc screen capture on chromium.

Not all clients exhibit this behavior, I have only ever seen foot consistently have this issue. Foot maybe having this issue because it only commits wl_buffers only when needed, while many other clients either commit multiple wl_buffers per commit, or just commit multiple times, but that of course is just speculation.

zwlr_screencopy_frame_v1_copy_with_damage(frame, cast->simple_frame.buffer);

Rummaging around a little bit, I have found that replacing copy_with_damage with copy, the issue seemingly goes away. Although that maybe because the render loop continuously sends frames, so maybe the bug still exists but now manifests as added latency. This also does not seem like a wlroots issue, because using wf-recorder will have the correct behavior even though it uses copy_with_damage under the hood. My monitor does normally run at 144hz, so maybe it's dropping every other frame to try to reach 60hz? That issue still exists when I force the monitor to run at 60hz though.

2021-07-24.17-52-42.mp4

Additional information:
wlroots/sway: 0.14.1 / 1.6.1
Pipewire: 0.3.32
used freshly compiled xdg-desktop-portal-wlr on commit 6cc3a01
GPU: Radeon RX 6900xt
OS: Arch linux

@Nefsen402
Copy link
Author

I made some progress on trying to fix this. It looks like pipewire buffers are not making their way fast enough from the sink to be filled with data. Consider this:

frame 1:
initiate frame capture
dequeue pipewire buffer
copy frame
queue pipewire buffer
frame 2:
initiate frame capture
dequeue pipewire buffer - !!!! fail because sink is not done with frame yet and the queue is still empty
skip copy frame 
# What happened in frame 2 can repeat for a indeterminate amount of time until the sink requeues the buffer
frame n:
# back to normal, there is a a frame in the queue to write to

So, what I think is happening is that when a frame is skipped because there is no buffer available to copy to, the damage state will be cleared anyway. So that means the next time it tries to do a frame copy, it won't get the damage notification since it's already been cleared in the failed frame beforehand.

There are two things that can be done:

  • On the wlroots side we can only clear damage only when copy_with_damage is called, not when a zwlr_screencopy_frame_v1 is created (I'm not crystal clear about how this all works still).
  • Probably what we should be doing: Just pause the capture loop until pipewire is available for us with a buffer.

I am investigating using flushing events in pipewire. However it seems even after pipewire fires the drained event, sometimes a buffer is not available still. In my testing though before the time the capture loop stalls due to no buffer, the bug seems to be fixed. Maybe we should try double/tripple buffering?

@emersion Does it seem right that the damage state is being cleared prematurely? I just wanted to clarify so I don't go down a rabbit hole.

@Nefsen402 Nefsen402 changed the title Every other frame dropped in output for certain clients Damage not being handled correctly Jul 25, 2021
@columbarius
Copy link
Collaborator

columbarius commented Jul 25, 2021

@Need4Speed402
Just some general question:
What client are you using and how many buffers are negotiated¹?

If you are using obs xdg-desktop-portal-wlr will usually settle on only one buffer, which gets blocked by obs asynchronous rendering.

If you want to try tripple buffering, you can increase XDPW_PWR_BUFFERS to 3 in https://github.com/emersion/xdg-desktop-portal-wlr/blob/master/include/pipewire_screencast.h#L6.

Be aware, that pipewire clients will skip to the newest buffer in the queue if multiple are available.

¹I've read you are using obs and chromium, so mostly part 2 is important for each.

@Nefsen402
Copy link
Author

@columbarius I'm actually working on your columbarius:screencast-pw-driven branch. I cloned it hoping it would fix the problem, but it hasn't. Instead, I got a bunch of spam in the console:

pipewire: out of buffers
wlroots: failed to import buffer

So looking into that made me realize about the problem. I'm working on this branch because it seems like that's the direction we want to go in anyway.

A part of me is telling me to scrap all the code. I'd love to fix things like multi monitor and sharing of a certain region going. I'm just not sure how I would coordinate with you guys. I work from home and I use screen sharing often especially on terminals, I'm highly motivated to get all this stuff working. After all, it just seems like some basic state needs to get passed around. A buffer and some metadata right? I might be flying high on mount stupid right now though. I'm still trying to understand all the code before I make a rash decision.

@Nefsen402
Copy link
Author

I figured out the race condition and here is the patch so far. It's built on @columbarius screencast-pw-driven branch.
fix.txt

It only handles the happy path, I basically nuked the code that used to handle
fps throttling just to make things easier. I ended up double buffering and using flush events to coordinate everything. Setting the code to use one buffer will make the damage tracking problem return.

Foot terminal now renders correctly through obs and chromium. In fact, there's a bonus: the cursor ran into the exact same damage tracking issue as foot. Foot just made it really obvious. Now it looks like the cursor on obs renders at 144hz and also every other client. Before I assumed that obs was just asking for 60hz. Apparently not.

Now, for the question I'm having trouble answering: should I clean up the code and submit it? Or should I do a more extensive refactor? I'm leaning towards a refactor simply because errors never seemed to get handled that well anyway, I'm not sure how to augment my hacky code to get proper error handling and clean up.

2021-07-25.15-27-30.mp4

@columbarius
Copy link
Collaborator

@columbarius I'm actually working on your columbarius:screencast-pw-driven branch. I cloned it hoping it would fix the problem, but it hasn't. Instead, I got a bunch of spam in the console:

Thanks for testing my patch.

pipewire: out of buffers
wlroots: failed to import buffer

Yeah this is a result of driving the pipewire loop directly from the wlroots loop and not passing back and forth between the wlroots and the pipewire loop. Increasing XDPW_PWR_BUFFERS should remove most pipewire: out of buffers events.

Could you please retry your tests with a higher XDPW_PWR_BUFFERS, which should remove those out of buffer events for the cost of some unused copies.

So looking into that made me realize about the problem. I'm working on this branch because it seems like that's the direction we want to go in anyway.

A part of me is telling me to scrap all the code. I'd love to fix things like multi monitor and sharing of a certain region going. I'm just not sure how I would coordinate with you guys. I work from home and I use screen sharing often especially on terminals, I'm highly motivated to get all this stuff working. After all, it just seems like some basic state needs to get passed around. A buffer and some metadata right? I might be flying high on mount stupid right now though. I'm still trying to understand all the code before I make a rash decision.

If started contributing to this project from a similar position a year ago. If you like to improve sharing of a certain region, you could take up #156 and add some way to set the region via chooser or the config file.
Yeah we mostly extract buffers from wlroots and pass them over to pipewire, matching the meta data.

@columbarius
Copy link
Collaborator

I figured out the race condition and here is the patch so far. It's built on @columbarius screencast-pw-driven branch.
fix.txt

...

Now, for the question I'm having trouble answering: should I clean up the code and submit it? Or should I do a more extensive refactor? I'm leaning towards a refactor simply because errors never seemed to get handled that well anyway, I'm not sure how to augment my hacky code to get proper error handling and clean up.

It would be nice, if you could clean up your patch only containing relevant changes. That would make it easier to review it.

I tried to reproduce your issue, but coudn't find any skipped frame. And the out of buffers went away after increasing XDPW_PWR_BUFFERS. Maybe this issue is more likely to happen on screens with higher refresh rates. Could you please describe where you see the race condition?

Maybe you could create a script with sth. like ydotool helping me to reproduce it?

@Nefsen402
Copy link
Author

Nefsen402 commented Jul 26, 2021

I run 2x [email protected] displays, I have quite an exotic setup that probably needs to push a lot of bandwidth. I'm a little surprised that my issue is not reproducible on others machines, I've been having this issue ever since I started using sway a year ago. It's not as if my computer sometimes has a happy day and decides not to do it, it was a consistent bug for me. Perhaps on a normal 1080p screen, the pipewire buffers usually make their way back to the return queue faster and so running out of pipewire buffers is rare? The bottleneck is in the capturer while in my situation its in the sink.

Yeah this is a result of driving the pipewire loop directly from the wlroots loop and not passing back and forth between the wlroots and the pipewire loop. Increasing XDPW_PWR_BUFFERS should remove most pipewire: out of buffers events.

In my experience, it would complain about missing pipewire buffers 20-30 times until it finally was able to grab one to capture to. We're not talking about maybe setting it to 3 would completely fix the problem. The real fix needs to have explicit synchronization. In my patch, I tried to do it with flush events but that's not doing exactly what I want it to, sometimes even then it would still exhaust pipewire buffers and then start skipping frames like before. Apparently there should be a pipewire process event but for whatever reason it's not firing for me. It sounds like that should be reliable if I can get it working. My patch is noting but a proof of concept and should not be taken seriously.

I can find a example of the process event used here for sinks: https://gitlab.freedesktop.org/pipewire/pipewire/-/blob/master/src/examples/video-play.c
However, all the capture examples use a fixed loop which is exactly not what I want to do.

Could you please describe where you see the race condition?

I was referring to a race condition I accidentally created when trying to implement my patch. As far as I know, there is no race condition other running out of pipewire buffers and prematurely clearing the damage state.

Maybe you could create a script with sth. like ydotool helping me to reproduce it?

From my understanding, it would seem easy simulate my situation by intentionally adding a delay for pipewire buffers making their way back to the capturer. I'm not sure how to make it happen externally though. Perhaps we can create a dummy client for testing?

@columbarius
Copy link
Collaborator

@Need4Speed402 could you please try if we are still skipping frames in #141? Added a loop to wait for pipewire buffers without draining the queue every time.

@Nefsen402
Copy link
Author

2021-08-08.12-21-15.mp4

Unfortunately, this doesn't fix my issue, and worse, it consumes 25% cpu time. Anyway, I wouldn't worry too much about it. I'm doing some work on my local branch where wlr_screencast.c is rewritten to be just a glorified callback, pipewire_screencast.c calls all the shots so to speak. Even my patch that I submitted doesn't completely fix the problem especially if my system is under heavy load, so I'm trying to take a radically different approach. I'm basically trying to get rid of the two concurrent loops that we currently have, one for wlr_screencast which on my machine spins in a loop multiple times for every time the pipewire loop resolves a buffer. I'm just trying to turn it into straight line linear code, but this is still an experiment at this point.

@columbarius
Copy link
Collaborator

@Nefsen402 Do you want to test if the issue still remains? The buffercopy got reworked a lot since then.

@Nefsen402
Copy link
Author

Nefsen402 commented Jun 2, 2022

Unfortunately, the bug is still present.

In fact, my GPU reset and had a complete freakout after a coulpe seconds of testing 🤷

@Nefsen402
Copy link
Author

I cannot reproduce this anymore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants