Replies: 13 comments 14 replies
-
I'm not sure what is possible. I'll take a look and get back to you. It does sound like there is a missing timeout somewhere. I've experienced that myself with HttpClient where a system seems stable enough for a few days until all the connections are busy waiting for servers that will never respond. One thing you can double check on your side is that you are closing all results iterators and closing all connections after you're done with them. |
Beta Was this translation helpful? Give feedback.
-
Thank you for your reply and help! After investigating my particular problem a bit more, I realized that running several federated queries with overlapping repo URLs (on the same rdfj4-workbench instance; not sure that matters though) can get you into what seems to be a deadlock if they are fired in close succession to each other. This deadlock then persists, doesn't timeout, and doesn't produce any error messages. Being able to set a timeout likely solves this issue, if my understanding of the situation is correct. If more investigation is needed, I can try to make a simple reproducible example. |
Beta Was this translation helpful? Give feedback.
-
wrt closing connections, yes I have already looked into that, and I am closing them all properly as far as I can see. |
Beta Was this translation helpful? Give feedback.
-
I found a workaround for this via the config of the Nginx HTTP server running in front of rdf4j-workbench:
URIs like I have been running the server like this for 6 days now, and all the problems have gone. Except that I am now getting "503 Service Temporarily Unavailable" errors under high load. This is not per se a problem (it's good that it fails fast), but the overall performance of rdf4j-workbench is now quite seriously reduced, I believe, by the fact that it cannot run concurrent queries anymore. So, it's only a workaround and not a fix, and it would be great if this issue can be somehow addressed on the RDF4J side (as I am now even more confident than before that there is an issue there). Also, I haven't really stress-tested it, so it can be that the above setting can still trigger deadlocks, but just makes them less likely. Of course, if there is anything I can do to help fix this issue, I am more than happy to do so! Even looking at your source code, if you can point me to the place where I should start looking. And lastly, let me take the opportunity to thank you for providing these incredibly powerful and immensely useful pieces of software! RDF4J is absolutely amazing 🙂 |
Beta Was this translation helpful? Give feedback.
-
I believe that this is the main culprit: We should add support for users to configure the timeout. Maybe something like this:
|
Beta Was this translation helpful? Give feedback.
-
Great to see that you found a potential solution. I'd be happy to test it on my side, if you want to provide me with a patch or branch with this code change. |
Beta Was this translation helpful? Give feedback.
-
Super, thanks, will try to find time to try it out the next few days and will let you know! |
Beta Was this translation helpful? Give feedback.
-
I did some local tests and I can confirm that with I was only wondering why the default timeouts seem to be set to 24h, instead of a smaller more practical amount of time? Thanks a lot for this fast fix! |
Beta Was this translation helpful? Give feedback.
-
OK, I managed to reproduce the problem in a few simple steps on a fresh RDF4J instance. See here: https://github.com/tkuhn/rdf4j-timeout-test It doesn't even involve issuing many queries in close succession. A single slighlty complex query is sufficient to trigger it (Query 2). In the case of the latest release, it hangs forever. In the case of the improved branch with the timeouts, it properly times out, but in either case it remains blocked and also simpler queries (like Query 1) no longer work. I hope that repo has all the things needed to reproduce it on your side with minimal effort. If there is anything I can do to improve this test, let me know. |
Beta Was this translation helpful? Give feedback.
-
I just noticed that the latest Docker image I was also wondering whether there are any updates on the root cause of this? Happy to invest more time and effort on my side on this, if it helps. |
Beta Was this translation helpful? Give feedback.
-
Any updates on this by chance? In any case, you mentioned that there seems to be a single connection thread for this, disconnected from where the other connection threads are handled. Could you maybe point me to the parts in the code where these two things happen, so I can try to have a look at it myself? I tried, but I didn't manage to find it. |
Beta Was this translation helpful? Give feedback.
-
I was looking into this yesterday. I found out that there are some queues that I think store intermediary results. When I increase the size of the queues, your query suddenly works fine. I added some logging, and I can see that when the queues are too small, they will fill up, and then the code locks up waiting for elements to be removed from the queue. I'm not completely confident in how the code works, but I get the feeling that the different layers in the query evaluation are consuming results from the child layers, and at some point, one of the steps needs to produce all the intermediary results before it can continue. When the queue isn't big enough, it ends up locking up with one part of the query waiting for another step to complete before it will start consuming the queue, while that other step is stuck because the queue is full. I want to try to further understand what's actually going on in the code. I know that when I increase the size of the queues, then the query works fine, but when I make the query more complex, it locks up again. So there is something wrong with the approach in the code, and I want to understand if it's something that can be fixed or not. I'll also try to make the sizes of the queues configurable, that way, you could increase the size to make your queries run as expected. Even though it would still mean that other queries could lock up if they need even larger queues. |
Beta Was this translation helpful? Give feedback.
-
Thanks for the update! That sounds very promising. Making the size of the queues configurable would probably help. Quite possibly, increasing that value a bit make my problem appear much less often, or possibly eliminate it altogether. If I can help in any way with this, let me know. |
Beta Was this translation helpful? Give feedback.
-
Hi all,
I am using the latest version of the RDF4J Workbench via Docker, and I am wondering whether there is a way to set HTTP client options (version, pooling, timeouts) for the connections made when federation is applied (so the request to URL in
"service <URL> { ... }"
?We have a use case where we use this federation extensively, and it normally works well, but starts to hang after a number of hours or days. My suspicion is that it has something to do with the HTTP requests triggered by the service keywords, but I couldn't find any way how I could play around with the options of that HTTP client. Any ideas?
Tobias
Beta Was this translation helpful? Give feedback.
All reactions