Make FakeStream::{headers, trailers} thread safe to make clang-18 happy #38167

krinkinmu · 2025-01-23T17:18:33Z

Commit Message:

Those methods return references to internal protected members, so ideally all the callers should acquire a lock before calling those.

I suspect that we got away with returning reference so far because in most cases we don't actully need to synchronize when calling any of these methods because synchronization happens erlier when we call one of waitForX methods.

Anyways, the new Clang complains about it, so we can as well fix it in a way that makes it clear that the methods are thread safe.

This PR changes the the signature of the methods to return a shared pointer to const data instead of references where it's not the case already.

Additional Description:

Related to #37911 and fixes one of the issues in #38093

body() method has a similar problem, but it's addressed in a separate PR: #38265

Risk Level: Low
Testing: bazel test //test/server/config_validation:config_fuzz_test --config=clang-libc++ (that's how I found the issue in the first place) + all the regular release gating tests.
Docs Changes: n/a
Release Notes: n/a
Platform Specific Features: n/a

+cc @phlax

repokitteh-read-only · 2025-01-23T17:18:39Z

As a reminder, PRs marked as draft will not be automatically assigned reviewers,
or be handled by maintainer-oncall triage.

Please mark your PR as ready when you want it to be reviewed!

🐱

Caused by: #38167 was opened by krinkinmu.

see: more, trace.

krinkinmu · 2025-01-23T23:13:05Z

/retest flaky test

adisuissa

Interesting! Thanks for pursuing that and paving the road towards a newer clang!
I agree that returning a reference to a data-member that is then not protected somewhat defeats the purpose of having a lock in the first place.
However, I think it may still be useful when the data-member is a pointer (so the pointer to the object isn't change concurrently).

Could you clarify a bit this:

The locks are really only needed to synchronize all the waitForX methods, accesors methods like body(), headers() and trailers() are called in tests after the appropriate waitForX method was called.

I was under the impression that the locks sync between the test's thread and the fake-upstream thread, so they are needed, regardless of the waitForX methods.

I agree that there may be some race-conditions here (as the main thread may access the internals of these objects, while the fake-upstream updates them), and honestly I haven't looked deep enough to understand what's going on, so feel free to shed more light here.

RE

Return copies of body, headers and trailers instead of references, create those copies under a lock - that would be the easiest way to let compiler know that the code is fine, but all three methods return abstract classes and currently there is no easy way to copy them (that's not to say, that copying is impossible in principle);

FWIW, I think that the right way to solve it is somewhat similar to what you proposed here. Specifically, have the data-members be pointers (just like the headers_ for example), move the pointer to a temp var, reset the data-member, and return the temp-var. The idea is to avoid copying the data, but there's still an overhead when getting the data, due to the creation of the new object.

krinkinmu · 2025-01-24T15:41:47Z

I was under the impression that the locks sync between the test's thread and the fake-upstream thread, so they are needed, regardless of the waitForX methods.

Let me clarify what I meant here, we still do need to synchronize between threads, but by the time we call body(), trailers() or headers() the syncrhonization already "happened". To clarify here is a hypothetical example of how a tests look:

do the test setup
trigger some action that you want to test
wait for a syncrhonization event with say waitForHeadersComplete
call headers() to check the result

In this scenario, by the time we headers() are called, we already synchronized. Unless headers get changed again somehow we don't really need a lock (and if they do, the current setup still does not protect us).

FWIW, I think that the right way to solve it is somewhat similar to what you proposed here. Specifically, have the data-members be pointers (just like the headers_ for example), move the pointer to a temp var, reset the data-member, and return the temp-var. The idea is to avoid copying the data, but there's still an overhead when getting the data, due to the creation of the new object.

I think it might work for some cases, but it would not protect against situations when body gets modified in place. In this case if we don't copy data we still will have a race condition - that seems like it defeats the purpose. Am I missing something?

Here

envoy/test/integration/fake_upstream.cc

Lines 59 to 63 in 55c6947

    
           void FakeStream::decodeData(Buffer::Instance& data, bool end_stream) { 
        
             received_data_ = true; 
        
             absl::MutexLock lock(&lock_); 
        
             body_.add(data); 
        
             setEndStream(end_stream);

. decodeData can be called multiple times appending to the existing buffer. I don't think we can replace it every time we call decodeData without changing semantics and if don't replace it every time we modify it, then even if we return a pointer instead of reference, accessing the body through the pointer still isn't thread safe.

I will double check, of course, but it seems to me that some copying is still needed here to make it thread safe.

krinkinmu · 2025-01-24T15:44:54Z

FWIW, I'm happy to spend more time on this and fix it properly rather than wave locks. I've only commented above because I didn't exactly understood how we can avoid copying.

adisuissa · 2025-01-24T15:56:27Z

FWIW, I think that the right way to solve it is somewhat similar to what you proposed here. Specifically, have the data-members be pointers (just like the headers_ for example), move the pointer to a temp var, reset the data-member, and return the temp-var. The idea is to avoid copying the data, but there's still an overhead when getting the data, due to the creation of the new object.

I think it might work for some cases, but it would not protect against situations when body gets modified in place. In this case if we don't copy data we still will have a race condition - that seems like it defeats the purpose. Am I missing something?

Here

envoy/test/integration/fake_upstream.cc

Lines 59 to 63 in 55c6947

void FakeStream::decodeData(Buffer::Instance& data, bool end_stream) {

received_data_ = true;

absl::MutexLock lock(&lock_);

body_.add(data);

setEndStream(end_stream);

. decodeData can be called multiple times appending to the existing buffer. I don't think we can replace it every time we call decodeData without changing semantics and if don't replace it every time we modify it, then even if we return a pointer instead of reference, accessing the body through the pointer still isn't thread safe.
I will double check, of course, but it seems to me that some copying is still needed here to make it thread safe.

This is because the body isn't a pointer. If it were changed to a pointer to an OwnedImpl, then whenever it is fetched, that pointer is replaced with a new OwnedImpl, and the old one is returned to the caller.

krinkinmu · 2025-01-24T16:10:24Z

FWIW, I think that the right way to solve it is somewhat similar to what you proposed here. Specifically, have the data-members be pointers (just like the headers_ for example), move the pointer to a temp var, reset the data-member, and return the temp-var. The idea is to avoid copying the data, but there's still an overhead when getting the data, due to the creation of the new object.

I think it might work for some cases, but it would not protect against situations when body gets modified in place. In this case if we don't copy data we still will have a race condition - that seems like it defeats the purpose. Am I missing something?
Here

envoy/test/integration/fake_upstream.cc

Lines 59 to 63 in 55c6947

void FakeStream::decodeData(Buffer::Instance& data, bool end_stream) {

received_data_ = true;

absl::MutexLock lock(&lock_);

body_.add(data);

setEndStream(end_stream);

. decodeData can be called multiple times appending to the existing buffer. I don't think we can replace it every time we call decodeData without changing semantics and if don't replace it every time we modify it, then even if we return a pointer instead of reference, accessing the body through the pointer still isn't thread safe.
I will double check, of course, but it seems to me that some copying is still needed here to make it thread safe.

This is because the body isn't a pointer. If it were changed to a pointer to an OwnedImpl, then whenever it is fetched, that pointer is replaced with a new OwnedImpl, and the old one is returned to the caller.

Imagine the following scenario:

Thread 1, adds some data to the body
Thread 2, thread-safely get a shared pointer to the body
Thread 1, adds more data to the body
Thread 2, access the body through the pointer

In this case if we allow thread 1 to append data to the body bit-by-bit (and not create a new body every time it calls decodeData), we will have a race condition between steps 3 and 4. On the other hand, if we don't allow thread 1 to append data to body bit-by-bit, it avoids a race condition, but it does seem like a change in semantics (currently it is possible to append to the body by calling decodeData multiple times).

adisuissa · 2025-01-24T16:55:17Z

I was under the impression that the locks sync between the test's thread and the fake-upstream thread, so they are needed, regardless of the waitForX methods.

Let me clarify what I meant here, we still do need to synchronize between threads, but by the time we call body(), trailers() or headers() the syncrhonization already "happened". To clarify here is a hypothetical example of how a tests look:

do the test setup

trigger some action that you want to test

wait for a syncrhonization event with say waitForHeadersComplete

call headers() to check the result

In this scenario, by the time we headers() are called, we already synchronized. Unless headers get changed again somehow we don't really need a lock (and if they do, the current setup still does not protect us).

2 points to consider:

wait has a timeout, so the call can exit before the headers are complete. In other words, the headers_ may be changed after the wait is called.
technically (although probably not in practice), the call to decodeHeaders(), if called twice for example, may move the pointer, and make it invalid while the main test thread is invoking 4. I suggest not relying on std::move to be considered an atomic function.

Reading the decodeHeaders() code I now understand why having the lock seems to be sufficient (taking into account only the base class). The idea is that the headers are changed (under a lock), and a new one is created (by the move there).
I haven't looked at the derived classes, so not sure if it all works well there.

adisuissa · 2025-01-24T16:57:38Z

This is because the body isn't a pointer. If it were changed to a pointer to an OwnedImpl, then whenever it is fetched, that pointer is replaced with a new OwnedImpl, and the old one is returned to the caller.

Imagine the following scenario:

Thread 1, adds some data to the body

Thread 2, thread-safely get a shared pointer to the body

Thread 1, adds more data to the body

Thread 2, access the body through the pointer

In this case if we allow thread 1 to append data to the body bit-by-bit (and not create a new body every time it calls decodeData), we will have a race condition between steps 3 and 4. On the other hand, if we don't allow thread 1 to append data to body bit-by-bit, it avoids a race condition, but it does seem like a change in semantics (currently it is possible to append to the body by calling decodeData multiple times).

Note that what I'm suggesting is to replace the owned impl - so instead of adding more data, it will return the current owned impl data, and create a new owned impl to replace the data member.

krinkinmu · 2025-01-24T17:15:40Z

Note that what I'm suggesting is to replace the owned impl - so instead of adding more data, it will return the current owned impl data, and create a new owned impl to replace the data member.

Yes, I understand that you're suggesting to replace the body and I also understand that it avoids a race condition.

What I'm trying to point out though, is that it also changes the semantics of the method. Basically, with this change we cannot append to the body anymore. Either call to the "body()" will reset the current value, or call to the decodeData will do it, but either way the old data will be lost if we don't copy it and that's not what the current implementation of these methods do.

krinkinmu · 2025-01-24T17:55:42Z

Maybe let me try to demonstrate my concern a bit more specifically and using an example.

Implementation for your suggestion, the way I understand it, might look something like this:

std::shared_ptr<Buffer::Instance> body() const {
    std::shared_ptr<Buffer::Instance> result;
    {
      absl::MutexLock lock(&lock_);
      result = body_;
    }
    return result;
}

 void decodeData(Buffer::Instance& data, bool end_stream) { 
   std::shared_ptr<Buffer::Instance> new_body = new Buffer::OwnedImpl(data);
   absl::MutexLock lock(&lock_); 
   received_data_ = true;
   body_ = new_body;
   setEndStream(end_stream);
}

Now, here is the alternative implementation that does additional copy:

std::shared_ptr<Buffer::Instance> body() const {
    std::shared_ptr<Buffer::Instance> result = new Buffer::OwnedImpl();
    {
      absl::MutexLock lock(&lock_);
      // instead of copying the pointer, we copy the data under a mutex
      result->add(body_);
    }
    return result;
}

 void decodeData(Buffer::Instance& data, bool end_stream) { 
   absl::MutexLock lock(&lock_); 
   received_data_ = true;
   body_.add(data);
   setEndStream(end_stream);
}

I think that both of these implementation do avoid a race condition, but they offer different behaviors. Let's look at this example to illustrate the difference:

decodeData("a", false);
decodeData("b", true);
assert(body().toString(), "ab");

For the first implementation that does not do a copy the assert check will fail, while for the second one it will not. I think that, leaving a race condition aside, the current implementation of body and decodeData behaves like the second alternative, not the first.

So it seems to me that if we want to preserve the current behavior we do need to copy data at some point. Am I misunderstanding your suggestion somehow?

adisuissa · 2025-01-24T18:01:41Z

Yeah, the body accessor semantics will need to change (but I'm not sure that this is a bad thing).
The alternative, if one wants to keep the semantics, is to hold 2 OwnedImpl objects, where decodeData() adds to object1, the accessor moves (not copies) the data from object1 and adds it to object2 (this is done under a lock), and object2 is returned.

krinkinmu · 2025-01-24T18:07:18Z

Yeah, the body accessor semantics will need to change (but I'm not sure that this is a bad thing). The alternative, if one wants to keep the semantics, is to hold 2 OwnedImpl objects, where decodeData() adds to object1, the accessor moves (not copies) the data from object1 and adds it to object2 (this is done under a lock), and object2 is returned.

I see, I think understand now what you mean.

Let me try first implement a version with just the pointers then and see how many tests (if any) actually depend on the current semantics. If none of them depends on the current semantics then it works. And if some do depend on the current semantics and cannot be easily fixed, then I can fallback to the move the approach with moving data blocks between two OwnedImpls.

alyssawilk · 2025-01-28T15:29:25Z

/wait on those changes

krinkinmu · 2025-02-05T18:45:53Z

/retest unicode library download failure

alyssawilk · 2025-02-06T13:56:36Z

Here's a thought, which I would not suggest outside of test.

What if we have upstream_thread_headers_ and client_thread_headers
upstream_thread_headers is guarded by the lock as it's accessed by both threads. It's inserted to when we get the headers. on first headers() access (client thread headers null), we grab the lock, move it to client_thread_headers (which is only accessed on the main test thread), and then we can always reference call into headers() (avoid the 1000 test changes) and avoid tsan complaining?
/wait-any

krinkinmu · 2025-02-06T14:36:25Z

Here's a thought, which I would not suggest outside of test.

What if we have upstream_thread_headers_ and client_thread_headers upstream_thread_headers is guarded by the lock as it's accessed by both threads. It's inserted to when we get the headers. on first headers() access (client thread headers null), we grab the lock, move it to client_thread_headers (which is only accessed on the main test thread), and then we can always reference call into headers() (avoid the 1000 test changes) and avoid tsan complaining? /wait-any

It works under the assumption that headers() can only be called from a single thread. While it's the case now, it does not seem like an assumption that is easy to enforce, if possible, my preference would be to avoid such assumptions.

If changing the interface is a big concern, maybe I can try and implement a thread safe wrapper around RequestHeaderMap? It would just delegate all the calls to another RequestHeaderMap implementation, but under a lock. This way we could return a reference to the map without changing the interface, because the RequestHeaderMap implemenetation itself is thread safe.

krinkinmu · 2025-02-07T09:50:34Z

@alyssawilk what would be your preference? Unless I'm overlooking something we have a few options:

Go with 2 maps as you suggested above?
Implement a thread safe wrapper around headers map?
Update a bunch of tests?

krinkinmu · 2025-02-11T15:28:15Z

@adisuissa @alyssawilk hey folks, I might have missed some communication or misunderstood something, can you please explain what the next steps for this PR should be?

My current understanding, is that we have a choice between a few options (briefly summarized in #38167 (comment)), but I don't feel like I got a signal on what of the options you prefer.

I don't feel particularly strongly about any of the options, so if I don't get any other feedback, I would probably implement the suggestion made by @alyssawilk in #38167 (comment), but before going that way, I wanted to draw attention that there are potentially other options on the table as well.

Looking forward to hearing your thoughts!

adisuissa · 2025-02-11T15:29:39Z

@alyssawilk what would be your preference? Unless I'm overlooking something we have a few options:

Go with 2 maps as you suggested above?

Implement a thread safe wrapper around headers map?

Update a bunch of tests?

@alyssawilk is out, so I'll present my preference.
IMHO this is a case where going for the more-correct solution (the current proposal - option 3) may cause more churn and work than the benefit it provides. As these are just tests, and there's an assumption that the upstreams will only be removed at the end of these tests, then it may be sufficient to go with option 1.
I don't fully grok option 2 (I think the ownership of the data is the issue we are observing here), but if it doesn't require too much work, and is straightforward, then it SGTM.

krinkinmu · 2025-02-12T11:29:17Z

@alyssawilk what would be your preference? Unless I'm overlooking something we have a few options:

Go with 2 maps as you suggested above?

Implement a thread safe wrapper around headers map?

Update a bunch of tests?

@alyssawilk is out, so I'll present my preference. IMHO this is a case where going for the more-correct solution (the current proposal - option 3) may cause more churn and work than the benefit it provides. As these are just tests, and there's an assumption that the upstreams will only be removed at the end of these tests, then it may be sufficient to go with option 1. I don't fully grok option 2 (I think the ownership of the data is the issue we are observing here), but if it doesn't require too much work, and is straightforward, then it SGTM.

Ah yes, I now see what you mean, header interface returns pointers and references, so we can't really protect it by locks and keep the interface as it is now. Ok, I will implement a two map approach then.

Returning reference trips thread safety analysis in clang-18, because returning a reference to a lock protected member is in general incorrect. I don't think that this is in practice causing any problems because tests call one of the waitForX functions before calling trailers/headers methods and thus by the time trailes/headers is being called synchronization would have already happened (and in the same thread). Still, we do want to migrate to a newer toolchain, so we should address those warnings to make clang-18 and transitively Envoy CI happy. I'm changing trailers to return a copy of a shared pointer instead of returning a reference to a pointer. While it changes the return type, it should not affect any callers. For headers it's a bit more complicated, as it returns a reference. To avoid changing the interface and updating multiple tests that use the headers method we went with a somewhat different approach that relies on a couple of assumptions: 1. headers() method is only called from a single thread, which may or may not be different from the thread that calls decodedHeaders() 2. headers cannot be updated in place - only replaced completely. With those two assumptions we can maintain two header maps, one for internal use (headers_) and one for the clients (client_headers_). The client_headers_ map is only accessed in the headers() method and therefore does not require a lock. headers_ map can be accessed from multiple threads, so it does require a lock, but it's purely internal and we do not expose it outside of the class. In the headers() method call we move the data from headers_ to client_headers_ if the data changed and return a reference to the new client_headers_. Signed-off-by: Mikhail Krinkin <[email protected]> Signed-off-by: Mikhail Krinkin <[email protected]>

krinkinmu · 2025-02-12T16:26:19Z

@adisuissa I think I'm finished with the implementation now.

adisuissa

Thanks!
Left a few comment nits, but otherwise LGTM.
Off to a senior-maintainer for approval.
/assign @RyanTheOptimist

test/integration/fake_upstream.h

RyanTheOptimist

LGTM, modulo Adi's comments.

Signed-off-by: Mikhail Krinkin <[email protected]>

adisuissa

LGTM, thanks!

krinkinmu force-pushed the fix-thread-safety-checks branch from 27e05ca to d43f17a Compare January 23, 2025 18:13

krinkinmu changed the title ~~Correct locking in FakeStream~~ Waive need for locking when calling FakeStream::{body,headers,trailers} Jan 23, 2025

krinkinmu force-pushed the fix-thread-safety-checks branch from d43f17a to 55c6947 Compare January 23, 2025 18:26

krinkinmu marked this pull request as ready for review January 23, 2025 23:27

adisuissa reviewed Jan 24, 2025

View reviewed changes

adisuissa self-assigned this Jan 24, 2025

repokitteh-read-only bot added the waiting label Jan 28, 2025

krinkinmu force-pushed the fix-thread-safety-checks branch from 55c6947 to ae4e846 Compare January 30, 2025 11:11

krinkinmu requested review from alyssawilk, nezdolik, ravenblackx, wbpcode, yanavlasov, tyxia, ggreenway, mattklein123 and KBaichoo as code owners January 30, 2025 11:11

adisuissa assigned alyssawilk Feb 5, 2025

krinkinmu force-pushed the fix-thread-safety-checks branch 4 times, most recently from 3f38532 to 6bb0521 Compare February 5, 2025 17:41

repokitteh-read-only bot added the waiting:any label Feb 6, 2025

repokitteh-read-only bot removed the waiting:any label Feb 6, 2025

alyssawilk removed their assignment Feb 6, 2025

krinkinmu mentioned this pull request Feb 10, 2025

thread-safety-reference-return warning #35558

Closed

krinkinmu force-pushed the fix-thread-safety-checks branch 3 times, most recently from 5a32094 to 0b96354 Compare February 12, 2025 13:10

krinkinmu force-pushed the fix-thread-safety-checks branch from 0b96354 to 25c2323 Compare February 12, 2025 14:25

adisuissa reviewed Feb 12, 2025

View reviewed changes

test/integration/fake_upstream.h Outdated Show resolved Hide resolved

test/integration/fake_upstream.h Show resolved Hide resolved

test/integration/fake_upstream.h Outdated Show resolved Hide resolved

test/integration/fake_upstream.h Show resolved Hide resolved

adisuissa assigned RyanTheOptimist Feb 12, 2025

RyanTheOptimist previously approved these changes Feb 12, 2025

View reviewed changes

Update comments to clarify the approach with two maps better

72961b8

Signed-off-by: Mikhail Krinkin <[email protected]>

krinkinmu dismissed RyanTheOptimist’s stale review via 72961b8 February 12, 2025 17:13

krinkinmu requested a review from adisuissa February 12, 2025 17:14

adisuissa approved these changes Feb 13, 2025

View reviewed changes

adisuissa merged commit 87b4ae8 into envoyproxy:main Feb 13, 2025
24 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make FakeStream::{headers, trailers} thread safe to make clang-18 happy #38167

Make FakeStream::{headers, trailers} thread safe to make clang-18 happy #38167

krinkinmu commented Jan 23, 2025 •

edited

Loading

repokitteh-read-only bot commented Jan 23, 2025

krinkinmu commented Jan 23, 2025

adisuissa left a comment

krinkinmu commented Jan 24, 2025

krinkinmu commented Jan 24, 2025

adisuissa commented Jan 24, 2025

krinkinmu commented Jan 24, 2025 •

edited

Loading

adisuissa commented Jan 24, 2025

adisuissa commented Jan 24, 2025

krinkinmu commented Jan 24, 2025

krinkinmu commented Jan 24, 2025

adisuissa commented Jan 24, 2025

krinkinmu commented Jan 24, 2025

alyssawilk commented Jan 28, 2025

krinkinmu commented Feb 5, 2025

alyssawilk commented Feb 6, 2025

krinkinmu commented Feb 6, 2025

krinkinmu commented Feb 7, 2025

krinkinmu commented Feb 11, 2025

adisuissa commented Feb 11, 2025

krinkinmu commented Feb 12, 2025

krinkinmu commented Feb 12, 2025

adisuissa left a comment

RyanTheOptimist left a comment

adisuissa left a comment

Make FakeStream::{headers, trailers} thread safe to make clang-18 happy #38167

Make FakeStream::{headers, trailers} thread safe to make clang-18 happy #38167

Conversation

krinkinmu commented Jan 23, 2025 • edited Loading

repokitteh-read-only bot commented Jan 23, 2025

krinkinmu commented Jan 23, 2025

adisuissa left a comment

Choose a reason for hiding this comment

krinkinmu commented Jan 24, 2025

krinkinmu commented Jan 24, 2025

adisuissa commented Jan 24, 2025

krinkinmu commented Jan 24, 2025 • edited Loading

adisuissa commented Jan 24, 2025

adisuissa commented Jan 24, 2025

krinkinmu commented Jan 24, 2025

krinkinmu commented Jan 24, 2025

adisuissa commented Jan 24, 2025

krinkinmu commented Jan 24, 2025

alyssawilk commented Jan 28, 2025

krinkinmu commented Feb 5, 2025

alyssawilk commented Feb 6, 2025

krinkinmu commented Feb 6, 2025

krinkinmu commented Feb 7, 2025

krinkinmu commented Feb 11, 2025

adisuissa commented Feb 11, 2025

krinkinmu commented Feb 12, 2025

krinkinmu commented Feb 12, 2025

adisuissa left a comment

Choose a reason for hiding this comment

RyanTheOptimist left a comment

Choose a reason for hiding this comment

adisuissa left a comment

Choose a reason for hiding this comment

krinkinmu commented Jan 23, 2025 •

edited

Loading

krinkinmu commented Jan 24, 2025 •

edited

Loading