Conversation
d6dc9a2 to
6cc5303
Compare
|
Thanks for creating this PR, @jtuglu1 ! The patch seems much simpler now. |
6cc5303 to
0deca3a
Compare
...xing-service/src/main/java/org/apache/druid/indexing/overlord/hrtr/HttpRemoteTaskRunner.java
Outdated
Show resolved
Hide resolved
...-service/src/test/java/org/apache/druid/indexing/overlord/hrtr/HttpRemoteTaskRunnerTest.java
Fixed
Show fixed
Hide fixed
006a079 to
f1b210a
Compare
kfaraz
left a comment
There was a problem hiding this comment.
Leaving a partial review, will try to finish going through the rest of the changes today.
...xing-service/src/main/java/org/apache/druid/indexing/overlord/hrtr/HttpRemoteTaskRunner.java
Outdated
Show resolved
Hide resolved
...xing-service/src/main/java/org/apache/druid/indexing/overlord/hrtr/HttpRemoteTaskRunner.java
Outdated
Show resolved
Hide resolved
...xing-service/src/main/java/org/apache/druid/indexing/overlord/hrtr/HttpRemoteTaskRunner.java
Outdated
Show resolved
Hide resolved
...xing-service/src/main/java/org/apache/druid/indexing/overlord/hrtr/HttpRemoteTaskRunner.java
Show resolved
Hide resolved
...xing-service/src/main/java/org/apache/druid/indexing/overlord/hrtr/HttpRemoteTaskRunner.java
Outdated
Show resolved
Hide resolved
...xing-service/src/main/java/org/apache/druid/indexing/overlord/hrtr/HttpRemoteTaskRunner.java
Outdated
Show resolved
Hide resolved
...xing-service/src/main/java/org/apache/druid/indexing/overlord/hrtr/HttpRemoteTaskRunner.java
Outdated
Show resolved
Hide resolved
...xing-service/src/main/java/org/apache/druid/indexing/overlord/hrtr/HttpRemoteTaskRunner.java
Outdated
Show resolved
Hide resolved
...xing-service/src/main/java/org/apache/druid/indexing/overlord/hrtr/HttpRemoteTaskRunner.java
Show resolved
Hide resolved
...xing-service/src/main/java/org/apache/druid/indexing/overlord/hrtr/HttpRemoteTaskRunner.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Finished going through the bulk of the changes.
On the whole, the patch looks good. I have these major suggestions:
- For the time being, it would be cleaner to use
workerStateLockconsistently whenever accessing theworkersmap. We can try to improve this later. - Avoid use of
.forEach()and use.compute()instead, preferably encasing it in anaddOrUpdatemethod similar toTaskQueue. - Do not perform any heavy operation like metadata store access, metric emission, listener notification, etc. inside the
.compute()lambda. - Avoid throwing exceptions inside the lambda, if they are just to be caught back in the same method/loop. Instead, log an error and continue with the loop.
- Remove the priority scheduling changes for now.
- Reduce debug logging.
...xing-service/src/main/java/org/apache/druid/indexing/overlord/hrtr/HttpRemoteTaskRunner.java
Outdated
Show resolved
Hide resolved
...xing-service/src/main/java/org/apache/druid/indexing/overlord/hrtr/HttpRemoteTaskRunner.java
Outdated
Show resolved
Hide resolved
...xing-service/src/main/java/org/apache/druid/indexing/overlord/hrtr/HttpRemoteTaskRunner.java
Outdated
Show resolved
Hide resolved
...xing-service/src/main/java/org/apache/druid/indexing/overlord/hrtr/HttpRemoteTaskRunner.java
Show resolved
Hide resolved
...xing-service/src/main/java/org/apache/druid/indexing/overlord/hrtr/HttpRemoteTaskRunner.java
Outdated
Show resolved
Hide resolved
...xing-service/src/main/java/org/apache/druid/indexing/overlord/hrtr/HttpRemoteTaskRunner.java
Show resolved
Hide resolved
...xing-service/src/main/java/org/apache/druid/indexing/overlord/hrtr/HttpRemoteTaskRunner.java
Outdated
Show resolved
Hide resolved
...xing-service/src/main/java/org/apache/druid/indexing/overlord/hrtr/HttpRemoteTaskRunner.java
Show resolved
Hide resolved
...xing-service/src/main/java/org/apache/druid/indexing/overlord/hrtr/HttpRemoteTaskRunner.java
Show resolved
Hide resolved
...xing-service/src/main/java/org/apache/druid/indexing/overlord/hrtr/HttpRemoteTaskRunner.java
Show resolved
Hide resolved
|
I will move this out from 36.0.0 for now - it doesn't seem like something which should block the release. |
f1b210a to
3237a49
Compare
e2ca69f to
87d0be5
Compare
1ee0493 to
1c741ec
Compare
|
@gianm any thoughts here? |
I will try to take a look. It may take some time to get to it, since the changes look quite extensive. Have you run this on a real production at-scale cluster yet (something with hundreds or thousands of tasks running simultaneously, ideally)? If so, that's always helpful to know. |
Yes, no observed issues. We run with close to 10k tasks at peak per cluster. |
4eaed68 to
626a95c
Compare
...vice/src/main/java/org/apache/druid/indexing/overlord/hrtr/HttpRemoteTaskRunnerResource.java
Show resolved
Hide resolved
...vice/src/main/java/org/apache/druid/indexing/overlord/hrtr/HttpRemoteTaskRunnerResource.java
Show resolved
Hide resolved
|
|
||
| // CAUTION: This method calls RemoteTaskRunnerWorkItem.setResult(..) which results in TaskQueue.notifyStatus() being called | ||
| // because that is attached by TaskQueue to task result future. So, this method must not be called with "statusLock" | ||
| // because that is attached by TaskQueue to task result future. So, this method must not be called with "workerStatusLock" |
There was a problem hiding this comment.
Should refer to workerStateLock?
| (key, taskEntry) -> { | ||
| if (taskEntry == null) { | ||
| // Try to find information about it in the TaskStorage | ||
| Optional<TaskStatus> knownStatusInStorage = taskStorage.getStatus(taskId); |
There was a problem hiding this comment.
This is going to need to do a metadata call while holding a (partial) lock on tasks. I see the old code did it under statusLock, and also there's a resolved conversation about keeping this here. It's fine to keep it here, I suppose, but please include a comment about how this does a metadata call and may cause contention on tasks.
There was a problem hiding this comment.
I see the old code did it under statusLock, and also there's a resolved conversation about keeping this here. It's fine to keep it here, I suppose, but please include a comment about how this does a metadata call and may cause contention on tasks.
Yes, I can add a comment. The key point is this will only lock a (hopefully small depending on how ConcurrentHashmap determines the range size) subset of the task keys, allowing other tasks to continue their work.
| synchronized (workerStateLock) { | ||
| workerToAssign = findWorkerToRunTask(taskItem.getTask()); | ||
|
|
||
| if (workerToAssign == null) { |
There was a problem hiding this comment.
It looks like this code will park and wait if the next task from pendingTasks can't be assigned. But there are situations where that task A can't be assigned, but another, later task B can be assigned. For example, if there is 1 free slot, and task A has requiredCapacity: 2 while task B has requiredCapacity: 1. Another example: if strong worker affinity is configured, and none of the affinity workers for task A are available, but affinity workers for task B are available.
The old logic would potentially iterate the entire pendingTaskIds looking for an assignable task, essentially allowing tasks to skip the line in case they required different capacity or different affinity workers. Please update the new logic to handle this case.
There was a problem hiding this comment.
The old logic would potentially iterate the entire pendingTaskIds looking for an assignable task, essentially allowing tasks to skip the line in case they required different capacity or different affinity workers. Please update the new logic to handle this case.
Yes, I thought about this. The older logic was a bit cumbersome and hard to read and was overly-conservative (slow) in the locking behavior; would you be opposed to simply rescheduling this task? I was thinking of extending this to some sort of priority/backoff queue to address this problem.
There was a problem hiding this comment.
What do you mean by "rescheduling this task"?
The thing I'm worried about is that we need line-skipping behavior. Especially with strong worker affinity, it's important for tasks to be able to skip the line. For example: a typical config would have a set of affinity workers for batch tasks and a set for realtime tasks. When the batch affinity workers are full we want to continue to assign realtime tasks to the realtime affinity workers.
So, if the solution does give tasks the ability to skip the line, it should be OK.
There was a problem hiding this comment.
What do you mean by "rescheduling this task"?
Send the task to the back of the queue and effectively just filter through the queue until you can find a task that's runnable, or do a timed wait backoff if none are found after full iteration. This preserves FIFO ordering while still not causing HOL-blocking.
There was a problem hiding this comment.
I think we can address FIFO behavior in a follow-up. That is, prioritizing tasks' priority in the queue based on Task::getPriority() for example.
There was a problem hiding this comment.
Send the task to the back of the queue (effectively just filter through the queue until you can find a task that's runnable). This preserves FIFO ordering while still not causing HOL-blocking.
Sure, that's fine. But be careful to avoid a spin loop of rescheduling if no task is currently-schedulable.
| @Override | ||
| public void shutdown(String taskId, String reason) | ||
| { | ||
| if (!lifecycleLock.awaitStarted(1, TimeUnit.SECONDS)) { |
Description
Clone of #18729 but merged into current runner per @kfaraz request.
I've seen on the giant lock in
HttpRemoteTaskRunnercause severe performance degradation under heavy load(200-500ms per acquisition with 1000s of activeTasks can slow down the startPendingTasks loop in TaskQueue). This leads to scheduling delays, which leads to more lag, which auto-scales more tasks, ..., etc. The runner also has a few (un)documented races abundant in the code. This overhead also slows down query tasks under load (e.g. MSQE and others) which utilize the scheduler for execution.I'm attempting a rewrite of this class to optimize for throughput and safety.
Apart from the performance improvements/bug fixes, this will also include some new features:
I would ultimately like to make this the default
HttpRemoteTaskRunnerand have it run in all tests/production clusters, etc. as I think that would help catch more bugs/issues.Performance Testing
Test results thus far have shown ~100-300ms speed up per task runner operation (
add(), etc.). Over 1000s of tasks, this amounts to minutes of delay saved.Release note
Speed up throughput and improve thread safety of HttpRemoteTaskRunner
This PR has: