Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve readers by parallelizing I/O and compute operations #5401

Merged
merged 40 commits into from
Feb 10, 2025

Conversation

ypatia
Copy link
Member

@ypatia ypatia commented Dec 6, 2024

Today when a reader issues the I/O request to VFS, we block waiting for all I/O to finish before moving to unfiltering. We then block again waiting for unfiltering to be done for all tiles and then continue to processing the results.

This PR is part1 of the effort to minimize wait all points in reader code : it removes the need to wait for all I/O to be done, and uses async tasks to signal when a tile is done reading so that it can proceed to unfiltering.

Part2 will come in a future PR for using async tasks for unfiltering as well in order to remove then need to wait for a tile is done unfiltering so that it can proceed to result processing before copying to the user buffers.

[sc-59605]


TYPE: IMPROVEMENT
DESC: Improve readers by parallelizing I/O and compute operations

@ypatia ypatia marked this pull request as ready for review December 7, 2024 15:05
@ypatia ypatia closed this Dec 7, 2024
@ypatia ypatia reopened this Dec 7, 2024
@ypatia ypatia marked this pull request as draft December 7, 2024 15:09
@ypatia ypatia marked this pull request as ready for review December 9, 2024 08:32
Base automatically changed from yt/sc-59606/threadpool_with_tasks to dev December 9, 2024 08:36
@ypatia ypatia marked this pull request as draft December 9, 2024 10:06
@ypatia ypatia force-pushed the yt/sc-59605/dont_block_io branch 3 times, most recently from ed9b334 to 446700a Compare December 19, 2024 13:11
@ypatia ypatia changed the title WIP: Improve readers by parallelizing I/O and compute operations Improve readers by parallelizing I/O and compute operations Dec 20, 2024
@ypatia ypatia marked this pull request as ready for review December 20, 2024 07:59
@ypatia ypatia requested review from Shelnutt2 and rroelke December 20, 2024 08:04
Copy link
Member

@rroelke rroelke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left a couple comments/questions, but overall this looks good to me!

@@ -653,7 +653,7 @@ TEST_CASE_METHOD(

// specific relationship for failure not known, but these values
// will result in failure with data being written.
total_budget_ = "15000";
total_budget_ = "40000";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a pretty big bump. It's just due to the changes in the various sizeofs right? Not much that can be done about it I suppose. Can you quantify the change in memory usage?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not expecting this to affect memory usage overall as the memory budget we put constraints in theory the peak usage to that budget at any time. However It's a 3x in the size of the result tile, so it's very much affecting the number of result tiles we can load at the same time in memory given that constrained budget, so the number of iterations we'll need to do to load everything we need from disk, so performance :(

I need to understand why this bump is so big a bit better, come back with some good explanation and see if we can reduce.

Copy link
Member Author

@ypatia ypatia Jan 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First of all, I reverted changes in a couple of tests in this file that were marked as [.regression] so not run, as they were meant to fail even before my changes. This particular one was one of them.

In general, though, I still needed to adapt the budgets in a couple of tests. My analysis showed that the different new member variables we needed to add as part of this work into TileBase, Tile, and ResultTile classes made the ResultTile class to grow almost double in size. In particular, sizeof(GlobalOrderResultTile<BitmapType>) , a derived class of ResultTile, grew from 1576 to 3216. A big part of this increase was due to the recursive_mutex we added in Tile class (64 bytes in my Mac).. ResultTile stores vectors of TileTuples of Tiles so that easily adds up.

Another side-effect of the members we added is that they might have a different size on different architectures and this lead to an inconsistent number of internal loops in our test code that was meant to count them, so I had to disable that test. That test though showed something interesting on my Mac, which is that loop_num increased to 20 from 9 in that case. This hints that we might end up looping more internally in environments with restricted memory budget (like the TileDB Server) because of our increase in size of ResultTile.

I couldn't think of an obvious way to reduce the size of ResultTile in the current design of this PR, so we definitely need to be vigilant for the performance impact on different readers and queries of that increase.

@rroelke @Shelnutt2 for awareness

// taking the lock on tile_queue_mutex_. This is to avoid a deadlock where a
// lock is held forever while waiting for a result to be available, while
// the next scheduled task is deadlocking on that lock
rc.tile_->wait_all_coords();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel a bit like this is the wrong place to call this, it will call it once per cell but as I understand the changes it should be fine to call it just once per tile.
Hence you can add this after // Find a cell in the current result tile and then also once after GlobalOrderResultCoords rc(&*(rt_it[f]), cell_idx); in the // For all fragments loop of merge_result_cell_slabs.

I don't quite follow you about the deadlock situation - this seems to me like a correctness question. We can't put result tiles into the priority queue until the data is loaded.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't put result tiles into the priority queue until the data is loaded.

Why not? I thought the whole point of this work is to wait for a piece of data to be ready (either loaded or unfiltered) exactly at the point when we need to use it so that we are not busy waiting as much as possible and we can make progress while reading/unfiltering asynchronously.

I don't quite follow you about the deadlock situation

I am sorry the comment wasn't clear enough. Let me try to explain the deadlock scenario I've hit spuriously in testing which is due to the current design of our ThreadPool:

  • Overall this PR introduces 2 kinds of async tasks: I/O tasks and unfiltering tasks. An unfiltering task will block waiting on the I/O task to be ready, so that it can start unfiltering the result_tile that was read. Any processing that follows, such as merge_result_cell_slabs will then wait for the unfiltering task to be ready at the point in time that it'll need to use the data.
  • Our current rather naive ThreadPool uses a sort of stacking of tasks algorithm, in the sense that when a task is scheduled on a thread but not yet ready, it will try to execute another one from the pool of available tasks on the same thread, so that we don't end up in a deadlock situation where all threads are blocked waiting. When that other task is done, it will check if the previous task is now ready to continue executing.

Now the deadlock scenario I was experiencing with this mutex is the following:

  • Say we have 3 threads in our pool. Thread 1 is executing the main reader code. Thread 2 is running the async I/O task for ResultTile N. Thread 3 is running the async unfiltering task for ResultTile N, which is blocked waiting for the corresponding I/O task to be ready so that it can get the data. In the meantime Thread 3 execution reaches the parallel_for in merge_result_cell_slabs. There, a few async tasks are created that need to acquire the lock on tile_queue_mutex_ and then access the unfiltered data of -among others - ResultTile N.
  • Now it's possible that the async task that acquires the lock, wants to access ResultTile N and gets scheduled on Thread 3 that is blocked. This is our deadlock. The reason is that when Thread 1 will be done, unfiltering on Thread 3 shoud be unblocked, but this will not be possible because it will be stacked under that async task that took the mutex, which will be actually blocked waiting for that unfiltering task to be done :(

IMO this is a design flow of our ThreadPool and it's not the only deadlock situation I've had to cope with in this work. I had to make compromises in terms of performance elsewhere as well, for example when unfiltering a tile we used to do that in parallel chunks, but a similar deadlock situation would arise in that case to, exclusively due to that stacking of tasks in the threadpool that has unexpected side-effects as this one.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To destroy any ambiguity in my initial comment - I definitely agree that it is correct to call wait_all_coords somewhere in the vicinity of this function. My comment is really about the granularity. The task/future based approach does not give us feedback about when the tile is partially filled in, it either is not safely filled in or it is 100% filled in. Hence we only actually need to wait once per tile, not once per cell per tile, and I think the code should reflect that.

We can't put result tiles into the priority queue until the data is loaded.

Why not? I thought the whole point of this work is to wait for a piece of data to be ready (either loaded or unfiltered) exactly at the point when we need to use it so that we are not busy waiting as much as possible and we can make progress while reading/unfiltering asynchronously.

What I should have written instead of "We can't we result tiles into the priority queue until the data is loaded" was "This change strikes me as one for correctness, not one to do with deadlock".

The change in call site for this should be aligned with the goal of this PR. Both forms will wait for the first result tile from each fragment to be loaded before starting anything, and then subsequent tiles in each fragment will be awaited only once we reach them in the queue.

As for the deadlock itself -

it will try to execute another one from the pool of available tasks on the same thread

Oh yeah, I did see that in a stack trace I ran into earlier today. It just calls the function pointer on the same stack, doesn't it.

The scenario you describe looks like it would be pretty commonplace.

Proper coroutines would be great, wouldn't they!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, with coroutines we could suspend and resume tasks so problem solved 😍

Copy link
Member Author

@ypatia ypatia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for the review, it was very useful. I think I addressed most of the comments but I'll have to follow up on one of them after taking a deeper look.

@@ -653,7 +653,7 @@ TEST_CASE_METHOD(

// specific relationship for failure not known, but these values
// will result in failure with data being written.
total_budget_ = "15000";
total_budget_ = "40000";
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not expecting this to affect memory usage overall as the memory budget we put constraints in theory the peak usage to that budget at any time. However It's a 3x in the size of the result tile, so it's very much affecting the number of result tiles we can load at the same time in memory given that constrained budget, so the number of iterations we'll need to do to load everything we need from disk, so performance :(

I need to understand why this bump is so big a bit better, come back with some good explanation and see if we can reduce.

// taking the lock on tile_queue_mutex_. This is to avoid a deadlock where a
// lock is held forever while waiting for a result to be available, while
// the next scheduled task is deadlocking on that lock
rc.tile_->wait_all_coords();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't put result tiles into the priority queue until the data is loaded.

Why not? I thought the whole point of this work is to wait for a piece of data to be ready (either loaded or unfiltered) exactly at the point when we need to use it so that we are not busy waiting as much as possible and we can make progress while reading/unfiltering asynchronously.

I don't quite follow you about the deadlock situation

I am sorry the comment wasn't clear enough. Let me try to explain the deadlock scenario I've hit spuriously in testing which is due to the current design of our ThreadPool:

  • Overall this PR introduces 2 kinds of async tasks: I/O tasks and unfiltering tasks. An unfiltering task will block waiting on the I/O task to be ready, so that it can start unfiltering the result_tile that was read. Any processing that follows, such as merge_result_cell_slabs will then wait for the unfiltering task to be ready at the point in time that it'll need to use the data.
  • Our current rather naive ThreadPool uses a sort of stacking of tasks algorithm, in the sense that when a task is scheduled on a thread but not yet ready, it will try to execute another one from the pool of available tasks on the same thread, so that we don't end up in a deadlock situation where all threads are blocked waiting. When that other task is done, it will check if the previous task is now ready to continue executing.

Now the deadlock scenario I was experiencing with this mutex is the following:

  • Say we have 3 threads in our pool. Thread 1 is executing the main reader code. Thread 2 is running the async I/O task for ResultTile N. Thread 3 is running the async unfiltering task for ResultTile N, which is blocked waiting for the corresponding I/O task to be ready so that it can get the data. In the meantime Thread 3 execution reaches the parallel_for in merge_result_cell_slabs. There, a few async tasks are created that need to acquire the lock on tile_queue_mutex_ and then access the unfiltered data of -among others - ResultTile N.
  • Now it's possible that the async task that acquires the lock, wants to access ResultTile N and gets scheduled on Thread 3 that is blocked. This is our deadlock. The reason is that when Thread 1 will be done, unfiltering on Thread 3 shoud be unblocked, but this will not be possible because it will be stacked under that async task that took the mutex, which will be actually blocked waiting for that unfiltering task to be done :(

IMO this is a design flow of our ThreadPool and it's not the only deadlock situation I've had to cope with in this work. I had to make compromises in terms of performance elsewhere as well, for example when unfiltering a tile we used to do that in parallel chunks, but a similar deadlock situation would arise in that case to, exclusively due to that stacking of tasks in the threadpool that has unexpected side-effects as this one.

@ypatia ypatia force-pushed the yt/sc-59605/dont_block_io branch from c5899b7 to 260fc22 Compare January 7, 2025 14:23
@rroelke
Copy link
Member

rroelke commented Jan 7, 2025

Thanks for the responses, I will approve once we have a little more info on the memory budget change.

@ypatia ypatia force-pushed the yt/sc-59605/dont_block_io branch 3 times, most recently from c07ec20 to c8b6de6 Compare January 23, 2025 14:57
@ypatia
Copy link
Member Author

ypatia commented Jan 24, 2025

This long standing PR has suffered from 2 pain points:

  • Instability in CI: ASAN failures and hangs in windows tests would come and go, which means there were lifetime and deadlock issues still to be addressed.
  • Performance regression in legacy readers and overall no improvement in the overall picture of SOMA benchmarks.

I have therefore decided to finally split this work in 2 parts:

  • Part1 (this PR): removes the need to wait for all I/O to be done, and uses async tasks to signal when a tile is done reading so that it can proceed to unfiltering.
  • Part2 (future PR): removes then need to wait for a tile is done unfiltering. by using async tasks for unfiltering as well in order to proceed to result processing and then copying to the user buffers.

This PR implements part1, CI is passing reliably and benchmarks show no regression, but no significant speedup either. I think it can be safely merged.

@ypatia
Copy link
Member Author

ypatia commented Jan 30, 2025

This long standing PR has suffered from 2 pain points:

  • Instability in CI: ASAN failures and hangs in windows tests would come and go, which means there were lifetime and deadlock issues still to be addressed.
  • Performance regression in legacy readers and overall no improvement in the overall picture of SOMA benchmarks.

I have therefore decided to finally split this work in 2 parts:

  • Part1 (this PR): removes the need to wait for all I/O to be done, and uses async tasks to signal when a tile is done reading so that it can proceed to unfiltering.
  • Part2 (future PR): removes then need to wait for a tile is done unfiltering. by using async tasks for unfiltering as well in order to proceed to result processing and then copying to the user buffers.

This PR implements part1, CI is passing reliably and benchmarks show no regression, but no significant speedup either.

@rroelke for re-reviewing/approving
@Shelnutt2 FYI

Copy link
Member

@rroelke rroelke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have just a few more questions/comments before approving but this is almost done!

@@ -32,7 +32,7 @@ include(object_library)
#
commence(object_library filter)
this_target_sources(filter.cc filter_buffer.cc filter_storage.cc)
this_target_object_libraries(baseline buffer tiledb_crypto)
this_target_object_libraries(baseline buffer tiledb_crypto thread_pool)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still needed? The unfiltering will be in part 2, right? Should this also be?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

/**
* Whether to block waiting for io data to be ready before accessing data()
*/
const bool skip_waiting_on_io_task_;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIleBase is where this field is declared, yet TileBase itself does not have the I/O tasks. Is there a reason that this is here instead of in class Tile?

Similarly... is there ever a situation where the I/O task is valid and skip_waiting_on_io_task_ is true? Would std::optional<ThreadTask::SharedTask> be a more expressive way of doing the same thing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like a way better way. Let me try to make it an optional.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for way this is in TileBase and not in Tile, the reason is that it's also needed in WriterTile that inherits from TileBase

@rroelke
Copy link
Member

rroelke commented Feb 3, 2025

I built this branch and ran the tiledb-rs property tests on it and the most complicated of them passed with 20000+ iterations

This test generates a random sequence of writes, inserts them all, and then runs a few read queries to validate that we read back what we wrote.

Hopefully this lends a little extra confidence :)

Next step would be to run the rapidcheck tests from the other branch over here

@ypatia
Copy link
Member Author

ypatia commented Feb 4, 2025

I built this branch and ran the tiledb-rs property tests on it and the most complicated of them passed with 20000+ iterations

This test generates a random sequence of writes, inserts them all, and then runs a few read queries to validate that we read back what we wrote.

Hopefully this lends a little extra confidence :)

Next step would be to run the rapidcheck tests from the other branch over here

This is great, thank you for taking this intiative!

Copy link
Member

@rroelke rroelke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All right, let's merge it!

@ypatia ypatia force-pushed the yt/sc-59605/dont_block_io branch 2 times, most recently from 2b490ad to 25d2648 Compare February 6, 2025 11:08
This removes the read from waiting on all I/O operations and instead
moves the I/O task to be owned by the datablock itself. If the I/O
threadpool task is valid, we block on data access. This lets I/O and
compute be interleaved by only blocking on data when its ready to be
processed and allows for better background data loading.
@ypatia ypatia force-pushed the yt/sc-59605/dont_block_io branch from f9fac4b to b8a54f9 Compare February 10, 2025 10:58
@ypatia
Copy link
Member Author

ypatia commented Feb 10, 2025

Smoke testing with benchmarks backed by minio showed no regression (nor speedup), so I am merging it with enough confidence.

@ypatia ypatia merged commit 1ca277c into main Feb 10, 2025
59 checks passed
@ypatia ypatia deleted the yt/sc-59605/dont_block_io branch February 10, 2025 20:38
@teo-tsirpanis
Copy link
Member

/backport to release-2.27

Copy link
Contributor

Started backporting to release-2.27: https://github.com/TileDB-Inc/TileDB/actions/runs/13286609792

Copy link
Contributor

@teo-tsirpanis backporting to release-2.27 failed, the patch most likely resulted in conflicts:

$ git am --3way --ignore-whitespace --keep-non-patch changes.patch

Applying: Move I/O wait to datablock instead of reader.
Applying: Switch to SharedTask in order to allow multi-threaded access to future state
Applying: Fix unit test compilation
Applying: WIP: parallelize filter pipeline and interleave comparisons
Applying: Switch to ThreadPool::SharedTask instead of shared_ptr
Applying: Add recursive_mutex for thread-safety of tile ThreadPool::SharedTask checking.
Applying: WIP: try to store reference to FilterData on result tile, need to fix forward declaration issues currently
Applying: Adjust lambdas and avoid task components going out of scope.
Applying: Add new data_unsafe to Tile accessorsa.
Applying: Add stats tracking to new tasks for reading and unfiltering tiles
Applying: Fix unit test compilation
Applying: Add new zip_coordinates_unsafe
Applying: Wait until tasks are done before freeing tiles
Applying: Remove redundant shared future get
Applying: Fix null counts, check tasks are valid and other fixes
Applying: Fix RLE and dict decompression
Applying: Fix budget tests, g.o.r. result tile is now 3496 bytes in size
Using index info to reconstruct a base tree...
M	test/src/unit-sparse-global-order-reader.cc
Falling back to patching base and 3-way merge...
Auto-merging test/src/unit-sparse-global-order-reader.cc
CONFLICT (content): Merge conflict in test/src/unit-sparse-global-order-reader.cc
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
hint: When you have resolved this problem, run "git am --continue".
hint: If you prefer to skip this patch, run "git am --skip" instead.
hint: To restore the original branch and stop patching, run "git am --abort".
hint: Disable this message with "git config set advice.mergeConflict false"
Patch failed at 0017 Fix budget tests, g.o.r. result tile is now 3496 bytes in size
Error: The process '/usr/bin/git' failed with exit code 128

Please backport manually!

Copy link
Contributor

@teo-tsirpanis an error occurred while backporting to release-2.27, please check the run log for details!

Error: git am failed, most likely due to a merge conflict.

@teo-tsirpanis
Copy link
Member

/backport to release-2.27

Copy link
Contributor

Started backporting to release-2.27: https://github.com/TileDB-Inc/TileDB/actions/runs/13294173181

ypatia added a commit that referenced this pull request Feb 13, 2025
As @ihnorton has noticed while auditing the code in
#5401 , in result tile
destructor we were waiting on `data()` instead of `filtered_data()` for
the async task to have finished. This was a leftover from the previous
version of the code (before the split between part 1 and 2), where we
were also unfiltering in an async fashion, so `data()` was also blocking
waiting for the unfiltering shared future to be done.

This PR fixes this by waiting just for the I/O task to be done.

---
TYPE: IMPROVEMENT
DESC: Result tile wait_all should block on the async I/O result
ihnorton pushed a commit that referenced this pull request Feb 14, 2025
…ute operations (#5401) (#5451)

Backport of #5401 to release-2.27

---
TYPE: IMPROVEMENT
DESC: Improve readers by parallelizing I/O and compute operations

---------

Co-authored-by: Seth Shelnutt <[email protected]>
Co-authored-by: Ypatia Tsavliri <[email protected]>
Co-authored-by: Seth Shelnutt <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants