-
Notifications
You must be signed in to change notification settings - Fork 41k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spring Boot v3.4.0 causes our staging & production environment to hang and time out #43332
Comments
I think the critical information here is the state of the JVM threads. This would let us know what's preventing the app from serving requests. Can you capture this information and let us know? |
Hi @bclozel, Thank you for responding as quickly as you did. I created an endpoint that would return a thread dump like this:
Immediately after the application started on CloudRun, I was able to collect the first thread dump here: I can hit the unauthenticated endpoint "/" an unlimited number of times without any issues. After I authenticate with Spring Authentication Server 1.4.0 and try to hit a secure endpoint, the server becomes completely unresponsive and I am no longer able to catch a thread dump and Cloud Run times out the connection after 300 seconds. At this point, the unauthenticated endpoint "/" is no longer responsive regardless if I hit it unauthenticated or authenticated. Looks like this applies to our other Spring Boot instances as well... the moment I am authenticated with SAS and then try to call any endpoint... 💥☠️ Server hangs. There is nothing in any of the instance logs, including SAS, that show any errors or any looping or anything out of the ordinary. Here's a screenshot of what all the instance logs look like after : I don't see anything out of the ordinary with SAS v1.4.0 release notes in comparison to the SAS version Spring Boot 3.3.6 depends on: I also don't see anything that should concern us with the latest Spring Data MongoDb release: The app that is unresponsive, still logs MongoDb pings as you can see on this screenshot [as if everything is honky dory]: Let me know how else I can help here. It's not easy capturing a thread dump with Cloud Run when the app is in this, or any, state. |
It looks like something is blocking threads or some memory leak/infinite recursion. The first thread capture doesn’t show anything related to Spring. Maybe you are using Java agents or instrumentation libraries that are not compatible with the latest Spring version? |
There is unfortunately no way to do a After your last remark I am leaning towards Sentry being the culprit. spring-io/start.spring.io#1647 Will continue my investigation there. Close issue at will. Cheers Brian 🍻 |
I don't think we can track the source of the problem without a snapshot of the java threads when the app is having issues. This could come from any library on your classpath, anny java agent or incompatibility with a remote resource. I haven't seen anything so far pointing to Spring Boot causing issues; we can reopen this issue if we find new information. I'm not familiar enough with Google Cloud run but not being able to connect to the JVM in any way is quite limiting. Maybe is there a way to configure the instance to open a port and connect a profiler to the running JVM? Closing this issue for now. |
@bjornharvold rather than using an endpoint to trigger the thread dump, perhaps you could temporarily add another thread to your app that does it periodically? |
Hi @wilkinsona That could work. Will give that a go and report back. What would be the best way to handle the dump once done on the Cloud Run instance in your opinion? |
Hopefully it can be output to somewhere from which it can then be copied and pasted for further analysis. If you like some help with that analysis, you could paste it here (if it's not too long) or into a gist or the like. It's very unlikely that it'll contain anything that should be kept confidential, but please do double-check first just in case class or method names reveal more than you'd like. |
Hi @wilkinsona We added a 10 min thread dump scheduler where we right to the GCP Cloud Log. Here's it what it came back with after the the Spring Boot app became unresponsive.
Chat-GPT came back with this response:
|
Any thoughts on this would be appreciated. Our runs great on 3.3.x. and has been for the entirety of 3.3.x and 3.2.x. 🍻 |
Reformatted thread dump:
|
Spring Boot 3.3.x uses 5.0.x of the MongoDB Java driver and Spring Boot 3.4.x uses 5.2.x. Given that upgrade and the large number of Mongo-related threads in the dump that are waiting for something, it does indeed look like a Mongo-related problem. Neither Spring Boot nor Spring Data MongoDB are involved with Mongo at this low level and, in your position, I'd continue the investigation working on the assumption that it's a Mongo Java Driver problem. It appears they've had a similar problem in the past. One thing that's not clear from the dump is why the server's become unresponsive. I can't see any web request threads stuck waiting for something. Perhaps there wasn't a stuck request at the time of the dump? Regardless, I think addressing the apparent problem with Mongo should be the next step – it'll either solve the problem or eliminate a possible cause. |
Thank you @wilkinsona 🦸♂️
We will need to keep on investigating. 😑 |
If it's not Mongo that's the problem, then I would turn attention to the web server. Looking at the thread dump, you appear to be using Undertow (the XNIO threads). Have you tried with Tomcat or Jetty? If the problem doesn't occur when using WebFlux (and assuming that you're using the default of Netty there), that may be another pointer towards an Undertow problem. |
This seems to have resolved itself with Spring Boot v3.4.3 🤔. Nothing in the release notes that really stands out. At least no Undertow upgrade. Staging has not become unresponsive in the last 24 hours. We will keep on testing and report back. |
I can confirm that it has not been resolved yet. Just deployed to production w v3.4.3. Spring Authorization Server instance is running fine. Our webflux instance runs fine. Our 2 Spring Web MVC instances that see the most traffic both started to hang out some point after release. One was complaining about OutOfMemory so we upgraded from 2Gb to 8Gb. Instance starts up and becomes unresponsive immediately so not a memory issue. We also replicated locally with the same amount of available server memory to rule that out a few weeks ago. Where we are at:
This all equates [approx] to 🫤 |
I am intrigued by this. How has the server identified that it hasn't hung and how are you being made aware of that? |
Hi @wilkinsona The last thread dump we shared was taken after the system had become unresponsive and there is nothing in there that implies it's hung. Also, the fact that we were able to take a thread dump from a scheduled task (not endpoint) shows that the instance is not completely unresponsive. That's how we came to this conclusion. Will now try to use Tomcat as you suggested earlier and report back. |
Thanks, I'm with you now. Given that the server's threads show no signs of being unable to handle an HTTP request, I also wonder if there's anything in front of the server that could be causing the problem. Perhaps it's only present in your staging and production environments which might explain why the dev environment is unaffected. It would hopefully be informative to know exactly where the requests are getting stuck. It doesn't look like it's in the server so, presumably, it's somewhere between the client and the server. Perhaps Google support can offer some guidance on diagnosing such a problem. |
...but why would Spring Boot v3.4.x be responsible for anything Google or Cloudflare does? V3.3.x working like a champ. |
Something may have changed how the connections are managed. For example, perhaps connections are being kept alive now when previously they were not. |
I think you should engage Google support at this point. The goal being to understand why a server that appears to be ready and able to accept requests is unable to do so. Perhaps they can provide some diagnostics about the network. |
Our services work on all Spring Boot versions prior to v3.4.0 and have been for years. v3.4.0 works in our dev environment and we are unable to reproduce what is occurring in staging and production.
We have 4 Spring Boot apps running on GCP
Here is what happens once the deployment hits staging / production:
GCP metrics (response times goes to 300 seconds and times out after release)

MongoDB console (Activity decreases after release)

Cloud Run startup log that shows Spring Boot starting up error free and goes directly to timing out

Maven dependency tree:
dependencies.txt.zip
The next step for us would be to turn on logging to high to see anything interesting shows up. Just spent a Saturday afternoon rolling back from production and trying to figure out where it was coming from. At this moment, I am completely clueless.
Any help would be appreciated.
We deploy Spring Boot with
mvn spring-boot:build-image
and the plugin config looks like this:Development environment:
Mac OS Sequoia
Atlas CLI
MongoDb 8
Java 23
Providers:
MongoDb Atlas
MongoDb 8
Google Cloud Platform / Cloud run
Java 23
The text was updated successfully, but these errors were encountered: