Setting HTTP client options for federation requests with rdf4j-workbench #5120

tkuhn · 2024-09-02T12:43:06Z

tkuhn
Sep 2, 2024

Hi all,

I am using the latest version of the RDF4J Workbench via Docker, and I am wondering whether there is a way to set HTTP client options (version, pooling, timeouts) for the connections made when federation is applied (so the request to URL in "service <URL> { ... }"?

We have a use case where we use this federation extensively, and it normally works well, but starts to hang after a number of hours or days. My suspicion is that it has something to do with the HTTP requests triggered by the service keywords, but I couldn't find any way how I could play around with the options of that HTTP client. Any ideas?

Tobias

hmottestad · 2024-09-03T10:02:58Z

hmottestad
Sep 3, 2024
Maintainer

I'm not sure what is possible. I'll take a look and get back to you.

It does sound like there is a missing timeout somewhere. I've experienced that myself with HttpClient where a system seems stable enough for a few days until all the connections are busy waiting for servers that will never respond.

One thing you can double check on your side is that you are closing all results iterators and closing all connections after you're done with them.

0 replies

tkuhn · 2024-09-03T10:10:01Z

tkuhn
Sep 3, 2024
Author

Thank you for your reply and help!

After investigating my particular problem a bit more, I realized that running several federated queries with overlapping repo URLs (on the same rdfj4-workbench instance; not sure that matters though) can get you into what seems to be a deadlock if they are fired in close succession to each other. This deadlock then persists, doesn't timeout, and doesn't produce any error messages.

Being able to set a timeout likely solves this issue, if my understanding of the situation is correct.

If more investigation is needed, I can try to make a simple reproducible example.

0 replies

tkuhn · 2024-09-03T10:11:59Z

tkuhn
Sep 3, 2024
Author

wrt closing connections, yes I have already looked into that, and I am closing them all properly as far as I can see.

0 replies

tkuhn · 2024-09-09T05:45:02Z

tkuhn
Sep 9, 2024
Author

I found a workaround for this via the config of the Nginx HTTP server running in front of rdf4j-workbench:

limit_conn_zone $uri zone=uri:10m;

server {
    ...
    location / {
        ...
        # This line limits concurrent requests per URI/repo to 1:
        limit_conn uri 1;
        proxy_pass http://127.0.1.1:8080/;
    }
}

URIs like https://example.com/rdf4j-server/repositories/demo (without query part) correspond to repos, so that's why this effectively makes sure that the same repo isn't accessed by two requests at the same time, which could cause a deadlock.

I have been running the server like this for 6 days now, and all the problems have gone. Except that I am now getting "503 Service Temporarily Unavailable" errors under high load. This is not per se a problem (it's good that it fails fast), but the overall performance of rdf4j-workbench is now quite seriously reduced, I believe, by the fact that it cannot run concurrent queries anymore.

So, it's only a workaround and not a fix, and it would be great if this issue can be somehow addressed on the RDF4J side (as I am now even more confident than before that there is an issue there). Also, I haven't really stress-tested it, so it can be that the above setting can still trigger deadlocks, but just makes them less likely.

Of course, if there is anything I can do to help fix this issue, I am more than happy to do so! Even looking at your source code, if you can point me to the place where I should start looking.

And lastly, let me take the opportunity to thank you for providing these incredibly powerful and immensely useful pieces of software! RDF4J is absolutely amazing 🙂

0 replies

hmottestad · 2024-09-13T08:13:37Z

hmottestad
Sep 13, 2024
Maintainer

I believe that this is the main culprit:

rdf4j/core/http/client/src/main/java/org/eclipse/rdf4j/http/client/SharedHttpClientSessionManager.java

Line 327 in 5c956e1

    
           .setDefaultRequestConfig(RequestConfig.custom().setCookieSpec(CookieSpecs.STANDARD).build())

We should add support for users to configure the timeout. Maybe something like this:

	private CloseableHttpClient createHttpClient() {
		HttpClientBuilder nextHttpClientBuilder = httpClientBuilder;
		if (nextHttpClientBuilder != null) {
			return nextHttpClientBuilder.build();
		}

		RequestConfig requestConfig = RequestConfig.custom()
				.setConnectTimeout(Integer.parseInt(
						System.getProperty(
								"org.eclipse.rdf4j.client.http.connectionTimeout",
								String.valueOf(24 * 60 * 60 * 1000))
				))
				.setConnectionRequestTimeout(Integer.parseInt(
						System.getProperty(
								"org.eclipse.rdf4j.client.http.connectionRequestTimeout",
								String.valueOf(24 * 60 * 60 * 1000))
				))
				.setSocketTimeout(Integer.parseInt(
						System.getProperty(
								"org.eclipse.rdf4j.client.http.socketTimeout",
								String.valueOf(24 * 60 * 60 * 1000))
				))
				.setCookieSpec(CookieSpecs.STANDARD)
				.build();

		return HttpClientBuilder.create()
				.evictExpiredConnections()
				.setRetryHandler(retryHandlerStale)
				.setServiceUnavailableRetryStrategy(serviceUnavailableRetryHandler)
				.useSystemProperties()
				.setDefaultRequestConfig(requestConfig)
				.build();
	}

0 replies

tkuhn · 2024-09-13T09:39:25Z

tkuhn
Sep 13, 2024
Author

Great to see that you found a potential solution. I'd be happy to test it on my side, if you want to provide me with a patch or branch with this code change.

1 reply

hmottestad Sep 13, 2024
Maintainer

You can take a look here: #5125

tkuhn · 2024-09-16T06:43:41Z

tkuhn
Sep 16, 2024
Author

Super, thanks, will try to find time to try it out the next few days and will let you know!

0 replies

tkuhn · 2024-09-16T12:59:49Z

tkuhn
Sep 16, 2024
Author

I did some local tests and I can confirm that with rdf4j-workbench:5.0.3-SNAPSHOT I can no longer reproduce the errors that I was getting with rdf4j-workbench:5.0.2. I now get timeout or HTTP pooling errors after the specified timeouts run out, instead of permanent deadlocks. So the problem seems to be resolved indeed!

I was only wondering why the default timeouts seem to be set to 24h, instead of a smaller more practical amount of time?

Thanks a lot for this fast fix!

10 replies

tkuhn Sep 20, 2024
Author

No, I did it with curl request to the end point via the command line.

Actually, trying it out some more, it seems I had another instance of deadlock that didn't time out even with the new setting. But I am not sure, it might have been something else. I have to do it again in a more systematic fashion.

I will try to make a script that loads a fresh database and then runs queries to consistently reproduce the problem. That will be able to better tell us whether the problem is properly resolved, and also possibly find out more about the problem in the first place.

I will get back to you when I have that script and some results. Probably some time next week.

hmottestad Sep 20, 2024
Maintainer

Could you paste the curl command so I can see what is triggered on the sever side w.r.t. any resources that might not get closed automatically?

tkuhn Sep 20, 2024
Author

It's a bit complicated as with the URLs I am using there are some redirects involved that depend on my particular setting. But it boils down to curl commands like this one to our server (accessing one repo and connecting to two other repos via the service keyword):

$ curl 'https://query.knowledgepixels.com/repo/empty?query=select%20%2A%20where%20%7B%0A%20%20service%20%3Chttps%3A%2F%2Fquery.knowledgepixels.com%2Frepo%2Ffull%3E%20%7B%20select%20%2A%20%7B%20%3Fs1%20%3Fp1%20%3Fo1%20%7D%20limit%2010%20%7D%0A%20%20service%20%3Chttps%3A%2F%2Fquery.knowledgepixels.com%2Frepo%2Fmeta%3E%20%7B%20select%20%2A%20%7B%20%3Fs2%20%3Fp2%20%3Fo2%20%7D%20limit%2010%20%7D%0A%7D'

If you want to try it out locally, it would probably have to be something like this (assuming you have 3 repos named repo1, repo2, and repo3):

$ curl 'http://localhost:8080/rdf4j-server/repositories/repo1?select%20%2A%20where%20%7B%0A%20%20service%20%3Chttp%3A%2F%2Flocalhost%3A8080%2Frdf4j-server%2Frepositories%2Frepo2%3E%20%7B%20select%20%2A%20%7B%20%3Fs1%20%3Fp1%20%3Fo1%20%7D%20limit%2010%20%7D%0A%20%20service%20%3Chttp%3A%2F%2Flocalhost%3A8080%2Frdf4j-server%2Frepositories%2Frepo3%3E%20%7B%20select%20%2A%20%7B%20%3Fs2%20%3Fp2%20%3Fo2%20%7D%20limit%2010%20%7D%0A%7D'

Individually they work but if I run many of them in close succession with a loop like this:

$ for i in {1..10}; do (curl ... &); done

Then they start getting stuck.

hmottestad Sep 20, 2024
Maintainer

That's very interesting. You're running them one after the other and waiting for the results from the previous one. So the previous query should be closing everything and there shouldn't really be anything from one query that holds up the next one.

I wonder if the query is running into a rate limit at the remote SERVICE endpoint.

What happens when you reduce the timeouts? Do queries return an error because the Apache HttpClient connection timed out? Do you see any errors being logged in the workbench/server logs?

I'm wondering if this is an underlying issue in RDF4J, like if something that should be closed isn't getting closed. Maybe you could try out setting the keep-alive system property for the Apache HttpClient? It should be http.keepAlive. You can see all available properties here: https://cwiki.apache.org/confluence/display/HTTPCOMPONENTS/HttpClientConfiguration

tkuhn Sep 20, 2024
Author

No, the curl queries run in parallel and are not waiting for the previous one to complete. Note the & that I put at the end of the curl command in the for loop.

I am pretty sure that the queries are not hanging in the HTTP client, as at some point I logged how the messages are passed, and they were still being sent to the rdf4j-server but the responses stopped coming back.

Anyway, I will try to reproduce this with a small script as I wrote above, and then I can also experiment with different timeout values etc.

tkuhn · 2024-09-24T07:42:27Z

tkuhn
Sep 24, 2024
Author

OK, I managed to reproduce the problem in a few simple steps on a fresh RDF4J instance. See here: https://github.com/tkuhn/rdf4j-timeout-test

It doesn't even involve issuing many queries in close succession. A single slighlty complex query is sufficient to trigger it (Query 2). In the case of the latest release, it hangs forever. In the case of the improved branch with the timeouts, it properly times out, but in either case it remains blocked and also simpler queries (like Query 1) no longer work.

I hope that repo has all the things needed to reproduce it on your side with minimal effort. If there is anything I can do to improve this test, let me know.

1 reply

hmottestad Sep 24, 2024
Maintainer

Thanks.

Impressive that you've managed to trigger this with a single query. That should really help to try to understand what's going on.

tkuhn · 2024-11-19T11:04:48Z

tkuhn
Nov 19, 2024
Author

I just noticed that the latest Docker image eclipse/rdf4j-workbench:5.0.3 contains the new system property handling, so I can switch to that one, which is great. Thanks for that!

I was also wondering whether there are any updates on the root cause of this? Happy to invest more time and effort on my side on this, if it helps.

2 replies

hmottestad Nov 19, 2024
Maintainer

I didn't have time to dig that much, but from the looks of things I think that maybe there was only a single thread for connections since the code that configured the number of threads seemed to not be connected to where the threads were created.

I'm very grateful for the reproducible test case you made. I'm currently trying to finish things off for the upcoming 5.1.0 release, so I'm a bit busy at the moment but I would really like to find some more time to play with your test case.

tkuhn Nov 20, 2024
Author

Thanks for the reply and update. Great to see that this library is so actively developed and evolving.

If anything is unclear with the test case, let me know.

tkuhn · 2025-01-08T15:34:05Z

tkuhn
Jan 8, 2025
Author

Any updates on this by chance?

In any case, you mentioned that there seems to be a single connection thread for this, disconnected from where the other connection threads are handled. Could you maybe point me to the parts in the code where these two things happen, so I can try to have a look at it myself? I tried, but I didn't manage to find it.

0 replies

hmottestad · 2025-02-06T08:23:54Z

hmottestad
Feb 6, 2025
Maintainer

I was looking into this yesterday. I found out that there are some queues that I think store intermediary results. When I increase the size of the queues, your query suddenly works fine. I added some logging, and I can see that when the queues are too small, they will fill up, and then the code locks up waiting for elements to be removed from the queue. I'm not completely confident in how the code works, but I get the feeling that the different layers in the query evaluation are consuming results from the child layers, and at some point, one of the steps needs to produce all the intermediary results before it can continue. When the queue isn't big enough, it ends up locking up with one part of the query waiting for another step to complete before it will start consuming the queue, while that other step is stuck because the queue is full.

I want to try to further understand what's actually going on in the code. I know that when I increase the size of the queues, then the query works fine, but when I make the query more complex, it locks up again. So there is something wrong with the approach in the code, and I want to understand if it's something that can be fixed or not.

I'll also try to make the sizes of the queues configurable, that way, you could increase the size to make your queries run as expected. Even though it would still mean that other queries could lock up if they need even larger queues.

0 replies

tkuhn · 2025-02-11T05:55:08Z

tkuhn
Feb 11, 2025
Author

Thanks for the update! That sounds very promising.

Making the size of the queues configurable would probably help. Quite possibly, increasing that value a bit make my problem appear much less often, or possibly eliminate it altogether.

If I can help in any way with this, let me know.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Setting HTTP client options for federation requests with rdf4j-workbench #5120

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 13 comments 14 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Setting HTTP client options for federation requests with rdf4j-workbench #5120

tkuhn Sep 2, 2024

Replies: 13 comments · 14 replies

hmottestad Sep 3, 2024 Maintainer

tkuhn Sep 3, 2024 Author

tkuhn Sep 3, 2024 Author

tkuhn Sep 9, 2024 Author

hmottestad Sep 13, 2024 Maintainer

tkuhn Sep 13, 2024 Author

hmottestad Sep 13, 2024 Maintainer

tkuhn Sep 16, 2024 Author

tkuhn Sep 16, 2024 Author

tkuhn Sep 20, 2024 Author

hmottestad Sep 20, 2024 Maintainer

tkuhn Sep 20, 2024 Author

hmottestad Sep 20, 2024 Maintainer

tkuhn Sep 20, 2024 Author

tkuhn Sep 24, 2024 Author

hmottestad Sep 24, 2024 Maintainer

tkuhn Nov 19, 2024 Author

hmottestad Nov 19, 2024 Maintainer

tkuhn Nov 20, 2024 Author

tkuhn Jan 8, 2025 Author

hmottestad Feb 6, 2025 Maintainer

tkuhn Feb 11, 2025 Author

tkuhn
Sep 2, 2024

Replies: 13 comments 14 replies

hmottestad
Sep 3, 2024
Maintainer

tkuhn
Sep 3, 2024
Author

tkuhn
Sep 3, 2024
Author

tkuhn
Sep 9, 2024
Author

hmottestad
Sep 13, 2024
Maintainer

tkuhn
Sep 13, 2024
Author

hmottestad Sep 13, 2024
Maintainer

tkuhn
Sep 16, 2024
Author

tkuhn
Sep 16, 2024
Author

tkuhn Sep 20, 2024
Author

hmottestad Sep 20, 2024
Maintainer

tkuhn Sep 20, 2024
Author

hmottestad Sep 20, 2024
Maintainer

tkuhn Sep 20, 2024
Author

tkuhn
Sep 24, 2024
Author

hmottestad Sep 24, 2024
Maintainer

tkuhn
Nov 19, 2024
Author

hmottestad Nov 19, 2024
Maintainer

tkuhn Nov 20, 2024
Author

tkuhn
Jan 8, 2025
Author

hmottestad
Feb 6, 2025
Maintainer

tkuhn
Feb 11, 2025
Author