Improve readers by parallelizing I/O and compute operations #5401

ypatia · 2024-12-06T15:16:20Z

Today when a reader issues the I/O request to VFS, we block waiting for all I/O to finish before moving to unfiltering. We then block again waiting for unfiltering to be done for all tiles and then continue to processing the results.

This PR is part1 of the effort to minimize wait all points in reader code : it removes the need to wait for all I/O to be done, and uses async tasks to signal when a tile is done reading so that it can proceed to unfiltering.

Part2 will come in a future PR for using async tasks for unfiltering as well in order to remove then need to wait for a tile is done unfiltering so that it can proceed to result processing before copying to the user buffers.

[sc-59605]

TYPE: IMPROVEMENT
DESC: Improve readers by parallelizing I/O and compute operations

rroelke

I've left a couple comments/questions, but overall this looks good to me!

tiledb/sm/query/readers/filtered_data.h

tiledb/sm/query/readers/reader_base.h

tiledb/sm/query/readers/result_tile.h

rroelke · 2025-01-07T03:33:04Z

test/src/unit-sparse-global-order-reader.cc

@@ -653,7 +653,7 @@ TEST_CASE_METHOD(

  // specific relationship for failure not known, but these values
  // will result in failure with data being written.
-  total_budget_ = "15000";
+  total_budget_ = "40000";


This is a pretty big bump. It's just due to the changes in the various sizeofs right? Not much that can be done about it I suppose. Can you quantify the change in memory usage?

I am not expecting this to affect memory usage overall as the memory budget we put constraints in theory the peak usage to that budget at any time. However It's a 3x in the size of the result tile, so it's very much affecting the number of result tiles we can load at the same time in memory given that constrained budget, so the number of iterations we'll need to do to load everything we need from disk, so performance :(

I need to understand why this bump is so big a bit better, come back with some good explanation and see if we can reduce.

First of all, I reverted changes in a couple of tests in this file that were marked as [.regression] so not run, as they were meant to fail even before my changes. This particular one was one of them.

In general, though, I still needed to adapt the budgets in a couple of tests. My analysis showed that the different new member variables we needed to add as part of this work into TileBase, Tile, and ResultTile classes made the ResultTile class to grow almost double in size. In particular, sizeof(GlobalOrderResultTile<BitmapType>) , a derived class of ResultTile, grew from 1576 to 3216. A big part of this increase was due to the recursive_mutex we added in Tile class (64 bytes in my Mac).. ResultTile stores vectors of TileTuples of Tiles so that easily adds up.

Another side-effect of the members we added is that they might have a different size on different architectures and this lead to an inconsistent number of internal loops in our test code that was meant to count them, so I had to disable that test. That test though showed something interesting on my Mac, which is that loop_num increased to 20 from 9 in that case. This hints that we might end up looping more internally in environments with restricted memory budget (like the TileDB Server) because of our increase in size of ResultTile.

I couldn't think of an obvious way to reduce the size of ResultTile in the current design of this PR, so we definitely need to be vigilant for the performance impact on different readers and queries of that increase.

@rroelke @Shelnutt2 for awareness

rroelke · 2025-01-07T03:42:38Z

tiledb/sm/query/readers/sparse_global_order_reader.cc

+    // taking the lock on tile_queue_mutex_. This is to avoid a deadlock where a
+    // lock is held forever while waiting for a result to be available, while
+    // the next scheduled task is deadlocking on that lock
+    rc.tile_->wait_all_coords();


I feel a bit like this is the wrong place to call this, it will call it once per cell but as I understand the changes it should be fine to call it just once per tile.
Hence you can add this after // Find a cell in the current result tile and then also once after GlobalOrderResultCoords rc(&*(rt_it[f]), cell_idx); in the // For all fragments loop of merge_result_cell_slabs.

I don't quite follow you about the deadlock situation - this seems to me like a correctness question. We can't put result tiles into the priority queue until the data is loaded.

We can't put result tiles into the priority queue until the data is loaded.

Why not? I thought the whole point of this work is to wait for a piece of data to be ready (either loaded or unfiltered) exactly at the point when we need to use it so that we are not busy waiting as much as possible and we can make progress while reading/unfiltering asynchronously.

I don't quite follow you about the deadlock situation

I am sorry the comment wasn't clear enough. Let me try to explain the deadlock scenario I've hit spuriously in testing which is due to the current design of our ThreadPool:

Overall this PR introduces 2 kinds of async tasks: I/O tasks and unfiltering tasks. An unfiltering task will block waiting on the I/O task to be ready, so that it can start unfiltering the result_tile that was read. Any processing that follows, such as merge_result_cell_slabs will then wait for the unfiltering task to be ready at the point in time that it'll need to use the data.

Our current rather naive ThreadPool uses a sort of stacking of tasks algorithm, in the sense that when a task is scheduled on a thread but not yet ready, it will try to execute another one from the pool of available tasks on the same thread, so that we don't end up in a deadlock situation where all threads are blocked waiting. When that other task is done, it will check if the previous task is now ready to continue executing.

Now the deadlock scenario I was experiencing with this mutex is the following:

Say we have 3 threads in our pool. Thread 1 is executing the main reader code. Thread 2 is running the async I/O task for ResultTile N. Thread 3 is running the async unfiltering task for ResultTile N, which is blocked waiting for the corresponding I/O task to be ready so that it can get the data. In the meantime Thread 3 execution reaches the parallel_for in merge_result_cell_slabs. There, a few async tasks are created that need to acquire the lock on tile_queue_mutex_ and then access the unfiltered data of -among others - ResultTile N.

Now it's possible that the async task that acquires the lock, wants to access ResultTile N and gets scheduled on Thread 3 that is blocked. This is our deadlock. The reason is that when Thread 1 will be done, unfiltering on Thread 3 shoud be unblocked, but this will not be possible because it will be stacked under that async task that took the mutex, which will be actually blocked waiting for that unfiltering task to be done :(

IMO this is a design flow of our ThreadPool and it's not the only deadlock situation I've had to cope with in this work. I had to make compromises in terms of performance elsewhere as well, for example when unfiltering a tile we used to do that in parallel chunks, but a similar deadlock situation would arise in that case to, exclusively due to that stacking of tasks in the threadpool that has unexpected side-effects as this one.

To destroy any ambiguity in my initial comment - I definitely agree that it is correct to call wait_all_coords somewhere in the vicinity of this function. My comment is really about the granularity. The task/future based approach does not give us feedback about when the tile is partially filled in, it either is not safely filled in or it is 100% filled in. Hence we only actually need to wait once per tile, not once per cell per tile, and I think the code should reflect that.

We can't put result tiles into the priority queue until the data is loaded.

Why not? I thought the whole point of this work is to wait for a piece of data to be ready (either loaded or unfiltered) exactly at the point when we need to use it so that we are not busy waiting as much as possible and we can make progress while reading/unfiltering asynchronously.

What I should have written instead of "We can't we result tiles into the priority queue until the data is loaded" was "This change strikes me as one for correctness, not one to do with deadlock".

The change in call site for this should be aligned with the goal of this PR. Both forms will wait for the first result tile from each fragment to be loaded before starting anything, and then subsequent tiles in each fragment will be awaited only once we reach them in the queue.

As for the deadlock itself -

it will try to execute another one from the pool of available tasks on the same thread

Oh yeah, I did see that in a stack trace I ran into earlier today. It just calls the function pointer on the same stack, doesn't it.

The scenario you describe looks like it would be pretty commonplace.

Proper coroutines would be great, wouldn't they!

Ah yes, with coroutines we could suspend and resume tasks so problem solved 😍

ypatia

Thank you very much for the review, it was very useful. I think I addressed most of the comments but I'll have to follow up on one of them after taking a deeper look.

ypatia · 2025-01-07T11:48:57Z

test/src/unit-sparse-global-order-reader.cc

@@ -653,7 +653,7 @@ TEST_CASE_METHOD(

  // specific relationship for failure not known, but these values
  // will result in failure with data being written.
-  total_budget_ = "15000";
+  total_budget_ = "40000";


I am not expecting this to affect memory usage overall as the memory budget we put constraints in theory the peak usage to that budget at any time. However It's a 3x in the size of the result tile, so it's very much affecting the number of result tiles we can load at the same time in memory given that constrained budget, so the number of iterations we'll need to do to load everything we need from disk, so performance :(

I need to understand why this bump is so big a bit better, come back with some good explanation and see if we can reduce.

tiledb/sm/query/readers/filtered_data.h

tiledb/sm/query/readers/reader_base.h

tiledb/sm/query/readers/result_tile.h

ypatia · 2025-01-07T14:07:13Z

tiledb/sm/query/readers/sparse_global_order_reader.cc

+    // taking the lock on tile_queue_mutex_. This is to avoid a deadlock where a
+    // lock is held forever while waiting for a result to be available, while
+    // the next scheduled task is deadlocking on that lock
+    rc.tile_->wait_all_coords();


We can't put result tiles into the priority queue until the data is loaded.

Why not? I thought the whole point of this work is to wait for a piece of data to be ready (either loaded or unfiltered) exactly at the point when we need to use it so that we are not busy waiting as much as possible and we can make progress while reading/unfiltering asynchronously.

I don't quite follow you about the deadlock situation

I am sorry the comment wasn't clear enough. Let me try to explain the deadlock scenario I've hit spuriously in testing which is due to the current design of our ThreadPool:

Overall this PR introduces 2 kinds of async tasks: I/O tasks and unfiltering tasks. An unfiltering task will block waiting on the I/O task to be ready, so that it can start unfiltering the result_tile that was read. Any processing that follows, such as merge_result_cell_slabs will then wait for the unfiltering task to be ready at the point in time that it'll need to use the data.

Our current rather naive ThreadPool uses a sort of stacking of tasks algorithm, in the sense that when a task is scheduled on a thread but not yet ready, it will try to execute another one from the pool of available tasks on the same thread, so that we don't end up in a deadlock situation where all threads are blocked waiting. When that other task is done, it will check if the previous task is now ready to continue executing.

Now the deadlock scenario I was experiencing with this mutex is the following:

Say we have 3 threads in our pool. Thread 1 is executing the main reader code. Thread 2 is running the async I/O task for ResultTile N. Thread 3 is running the async unfiltering task for ResultTile N, which is blocked waiting for the corresponding I/O task to be ready so that it can get the data. In the meantime Thread 3 execution reaches the parallel_for in merge_result_cell_slabs. There, a few async tasks are created that need to acquire the lock on tile_queue_mutex_ and then access the unfiltered data of -among others - ResultTile N.

Now it's possible that the async task that acquires the lock, wants to access ResultTile N and gets scheduled on Thread 3 that is blocked. This is our deadlock. The reason is that when Thread 1 will be done, unfiltering on Thread 3 shoud be unblocked, but this will not be possible because it will be stacked under that async task that took the mutex, which will be actually blocked waiting for that unfiltering task to be done :(

IMO this is a design flow of our ThreadPool and it's not the only deadlock situation I've had to cope with in this work. I had to make compromises in terms of performance elsewhere as well, for example when unfiltering a tile we used to do that in parallel chunks, but a similar deadlock situation would arise in that case to, exclusively due to that stacking of tasks in the threadpool that has unexpected side-effects as this one.

rroelke · 2025-01-07T20:53:28Z

Thanks for the responses, I will approve once we have a little more info on the memory budget change.

ypatia · 2025-01-24T13:26:53Z

This long standing PR has suffered from 2 pain points:

Instability in CI: ASAN failures and hangs in windows tests would come and go, which means there were lifetime and deadlock issues still to be addressed.
Performance regression in legacy readers and overall no improvement in the overall picture of SOMA benchmarks.

I have therefore decided to finally split this work in 2 parts:

Part1 (this PR): removes the need to wait for all I/O to be done, and uses async tasks to signal when a tile is done reading so that it can proceed to unfiltering.
Part2 (future PR): removes then need to wait for a tile is done unfiltering. by using async tasks for unfiltering as well in order to proceed to result processing and then copying to the user buffers.

This PR implements part1, CI is passing reliably and benchmarks show no regression, but no significant speedup either. I think it can be safely merged.

ypatia · 2025-01-30T08:39:02Z

This long standing PR has suffered from 2 pain points:

Instability in CI: ASAN failures and hangs in windows tests would come and go, which means there were lifetime and deadlock issues still to be addressed.

Performance regression in legacy readers and overall no improvement in the overall picture of SOMA benchmarks.

I have therefore decided to finally split this work in 2 parts:

Part1 (this PR): removes the need to wait for all I/O to be done, and uses async tasks to signal when a tile is done reading so that it can proceed to unfiltering.

Part2 (future PR): removes then need to wait for a tile is done unfiltering. by using async tasks for unfiltering as well in order to proceed to result processing and then copying to the user buffers.

This PR implements part1, CI is passing reliably and benchmarks show no regression, but no significant speedup either.

@rroelke for re-reviewing/approving
@Shelnutt2 FYI

rroelke

I have just a few more questions/comments before approving but this is almost done!

test/src/unit-sparse-global-order-reader.cc

test/src/unit-sparse-unordered-with-dups-reader.cc

rroelke · 2025-01-31T19:39:31Z

tiledb/sm/filter/CMakeLists.txt

@@ -32,7 +32,7 @@ include(object_library)
 #
 commence(object_library filter)
    this_target_sources(filter.cc filter_buffer.cc filter_storage.cc)
-    this_target_object_libraries(baseline buffer tiledb_crypto)
+    this_target_object_libraries(baseline buffer tiledb_crypto thread_pool)


Is this still needed? The unfiltering will be in part 2, right? Should this also be?

tiledb/sm/query/readers/result_tile.cc

rroelke · 2025-01-31T20:06:16Z

tiledb/sm/tile/tile.h

+  /**
+   * Whether to block waiting for io data to be ready before accessing data()
+   */
+  const bool skip_waiting_on_io_task_;


TIleBase is where this field is declared, yet TileBase itself does not have the I/O tasks. Is there a reason that this is here instead of in class Tile?

Similarly... is there ever a situation where the I/O task is valid and skip_waiting_on_io_task_ is true? Would std::optional<ThreadTask::SharedTask> be a more expressive way of doing the same thing?

Sounds like a way better way. Let me try to make it an optional.

As for way this is in TileBase and not in Tile, the reason is that it's also needed in WriterTile that inherits from TileBase

rroelke · 2025-02-03T18:38:28Z

I built this branch and ran the tiledb-rs property tests on it and the most complicated of them passed with 20000+ iterations

This test generates a random sequence of writes, inserts them all, and then runs a few read queries to validate that we read back what we wrote.

Hopefully this lends a little extra confidence :)

Next step would be to run the rapidcheck tests from the other branch over here

ypatia · 2025-02-04T11:31:27Z

I built this branch and ran the tiledb-rs property tests on it and the most complicated of them passed with 20000+ iterations

This test generates a random sequence of writes, inserts them all, and then runs a few read queries to validate that we read back what we wrote.

Hopefully this lends a little extra confidence :)

Next step would be to run the rapidcheck tests from the other branch over here

This is great, thank you for taking this intiative!

rroelke

All right, let's merge it!

This removes the read from waiting on all I/O operations and instead moves the I/O task to be owned by the datablock itself. If the I/O threadpool task is valid, we block on data access. This lets I/O and compute be interleaved by only blocking on data when its ready to be processed and allows for better background data loading.

This reverts commit 9647fff.

ypatia · 2025-02-10T20:38:13Z

Smoke testing with benchmarks backed by minio showed no regression (nor speedup), so I am merging it with enough confidence.

teo-tsirpanis · 2025-02-12T13:34:20Z

/backport to release-2.27

github-actions · 2025-02-12T13:34:32Z

Started backporting to release-2.27: https://github.com/TileDB-Inc/TileDB/actions/runs/13286609792

github-actions · 2025-02-12T13:34:41Z

@teo-tsirpanis backporting to release-2.27 failed, the patch most likely resulted in conflicts:

$ git am --3way --ignore-whitespace --keep-non-patch changes.patch

Applying: Move I/O wait to datablock instead of reader.
Applying: Switch to SharedTask in order to allow multi-threaded access to future state
Applying: Fix unit test compilation
Applying: WIP: parallelize filter pipeline and interleave comparisons
Applying: Switch to ThreadPool::SharedTask instead of shared_ptr
Applying: Add recursive_mutex for thread-safety of tile ThreadPool::SharedTask checking.
Applying: WIP: try to store reference to FilterData on result tile, need to fix forward declaration issues currently
Applying: Adjust lambdas and avoid task components going out of scope.
Applying: Add new data_unsafe to Tile accessorsa.
Applying: Add stats tracking to new tasks for reading and unfiltering tiles
Applying: Fix unit test compilation
Applying: Add new zip_coordinates_unsafe
Applying: Wait until tasks are done before freeing tiles
Applying: Remove redundant shared future get
Applying: Fix null counts, check tasks are valid and other fixes
Applying: Fix RLE and dict decompression
Applying: Fix budget tests, g.o.r. result tile is now 3496 bytes in size
Using index info to reconstruct a base tree...
M	test/src/unit-sparse-global-order-reader.cc
Falling back to patching base and 3-way merge...
Auto-merging test/src/unit-sparse-global-order-reader.cc
CONFLICT (content): Merge conflict in test/src/unit-sparse-global-order-reader.cc
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
hint: When you have resolved this problem, run "git am --continue".
hint: If you prefer to skip this patch, run "git am --skip" instead.
hint: To restore the original branch and stop patching, run "git am --abort".
hint: Disable this message with "git config set advice.mergeConflict false"
Patch failed at 0017 Fix budget tests, g.o.r. result tile is now 3496 bytes in size
Error: The process '/usr/bin/git' failed with exit code 128

Please backport manually!

github-actions · 2025-02-12T13:34:41Z

@teo-tsirpanis an error occurred while backporting to release-2.27, please check the run log for details!

Error: git am failed, most likely due to a merge conflict.

teo-tsirpanis · 2025-02-12T20:22:38Z

/backport to release-2.27

github-actions · 2025-02-12T20:22:54Z

Started backporting to release-2.27: https://github.com/TileDB-Inc/TileDB/actions/runs/13294173181

@ihnorton

As @ihnorton has noticed while auditing the code in #5401 , in result tile destructor we were waiting on `data()` instead of `filtered_data()` for the async task to have finished. This was a leftover from the previous version of the code (before the split between part 1 and 2), where we were also unfiltering in an async fashion, so `data()` was also blocking waiting for the unfiltering shared future to be done. This PR fixes this by waiting just for the I/O task to be done. --- TYPE: IMPROVEMENT DESC: Result tile wait_all should block on the async I/O result

…ute operations (#5401) (#5451) Backport of #5401 to release-2.27 --- TYPE: IMPROVEMENT DESC: Improve readers by parallelizing I/O and compute operations --------- Co-authored-by: Seth Shelnutt <[email protected]> Co-authored-by: Ypatia Tsavliri <[email protected]> Co-authored-by: Seth Shelnutt <[email protected]>

ypatia marked this pull request as ready for review December 7, 2024 15:05

ypatia closed this Dec 7, 2024

ypatia reopened this Dec 7, 2024

ypatia marked this pull request as draft December 7, 2024 15:09

ypatia marked this pull request as ready for review December 9, 2024 08:32

Base automatically changed from yt/sc-59606/threadpool_with_tasks to dev December 9, 2024 08:36

ypatia marked this pull request as draft December 9, 2024 10:06

ypatia force-pushed the yt/sc-59605/dont_block_io branch 3 times, most recently from ed9b334 to 446700a Compare December 19, 2024 13:11

ypatia changed the title ~~WIP: Improve readers by parallelizing I/O and compute operations~~ Improve readers by parallelizing I/O and compute operations Dec 20, 2024

ypatia marked this pull request as ready for review December 20, 2024 07:59

ypatia requested review from Shelnutt2 and rroelke December 20, 2024 08:04

rroelke reviewed Jan 7, 2025

View reviewed changes

ypatia commented Jan 7, 2025

View reviewed changes

ypatia force-pushed the yt/sc-59605/dont_block_io branch from c5899b7 to 260fc22 Compare January 7, 2025 14:23

ypatia force-pushed the yt/sc-59605/dont_block_io branch 3 times, most recently from c07ec20 to c8b6de6 Compare January 23, 2025 14:57

rroelke reviewed Jan 31, 2025

View reviewed changes

rroelke approved these changes Feb 5, 2025

View reviewed changes

ypatia force-pushed the yt/sc-59605/dont_block_io branch 2 times, most recently from 2b490ad to 25d2648 Compare February 6, 2025 11:08

ypatia added 15 commits February 10, 2025 12:24

Fix tile missing threadpool linkage

3c2a7b3

Remove duplicate library in cmake

06583cc

Disable temporarily flaky test

eba9585

Attempt to fix asan error

b154139

Fix segfault in legacy reader

c75469d

Revert "Attempt to fix asan error"

ef2bafd

This reverts commit 9647fff.

Fix some windows tests

b74b630

Fix lifetime issues and some namings

b0e6e4d

Fix ASAN: Destructors of base classes must be virtual

cd909c3

Some more PR cleanup

7c04975

Constrain non blocking tasks to I/O

9eecd20

Fixes to tests

14d2858

Address most review comments

9332d18

Address last comment and fix test for windows

54fa3c9

Fix object libraries tests

b8a54f9

ypatia force-pushed the yt/sc-59605/dont_block_io branch from f9fac4b to b8a54f9 Compare February 10, 2025 10:58

ypatia merged commit 1ca277c into main Feb 10, 2025
59 checks passed

ypatia deleted the yt/sc-59605/dont_block_io branch February 10, 2025 20:38

ypatia added the backport release-2.27 label Feb 11, 2025

ypatia mentioned this pull request Feb 12, 2025

Result tile wait_all should block on the async i/o result #5446

Merged

github-actions bot mentioned this pull request Feb 12, 2025

[Backport release-2.27] Improve readers by parallelizing I/O and compute operations (#5401) #5451

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve readers by parallelizing I/O and compute operations #5401

Improve readers by parallelizing I/O and compute operations #5401

ypatia commented Dec 6, 2024 •

edited

Loading

rroelke left a comment

rroelke Jan 7, 2025

ypatia Jan 7, 2025

ypatia Jan 14, 2025 •

edited

Loading

rroelke Jan 7, 2025

ypatia Jan 7, 2025

rroelke Jan 7, 2025

ypatia Jan 8, 2025

ypatia left a comment

ypatia Jan 7, 2025

ypatia Jan 7, 2025

rroelke commented Jan 7, 2025

ypatia commented Jan 24, 2025

ypatia commented Jan 30, 2025

rroelke left a comment

rroelke Jan 31, 2025

ypatia Feb 4, 2025

rroelke Jan 31, 2025

ypatia Feb 4, 2025

ypatia Feb 4, 2025

rroelke commented Feb 3, 2025

ypatia commented Feb 4, 2025

rroelke left a comment

ypatia commented Feb 10, 2025

teo-tsirpanis commented Feb 12, 2025

github-actions bot commented Feb 12, 2025

github-actions bot commented Feb 12, 2025

github-actions bot commented Feb 12, 2025

teo-tsirpanis commented Feb 12, 2025

github-actions bot commented Feb 12, 2025

Improve readers by parallelizing I/O and compute operations #5401

Improve readers by parallelizing I/O and compute operations #5401

Conversation

ypatia commented Dec 6, 2024 • edited Loading

rroelke left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ypatia Jan 14, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ypatia left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rroelke commented Jan 7, 2025

ypatia commented Jan 24, 2025

ypatia commented Jan 30, 2025

rroelke left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rroelke commented Feb 3, 2025

ypatia commented Feb 4, 2025

rroelke left a comment

Choose a reason for hiding this comment

ypatia commented Feb 10, 2025

teo-tsirpanis commented Feb 12, 2025

github-actions bot commented Feb 12, 2025

github-actions bot commented Feb 12, 2025

github-actions bot commented Feb 12, 2025

teo-tsirpanis commented Feb 12, 2025

github-actions bot commented Feb 12, 2025

ypatia commented Dec 6, 2024 •

edited

Loading

ypatia Jan 14, 2025 •

edited

Loading