device-linear-multifrag execution mode

As of now, HDK's heterogeneity looks like this:

- We have X fragments, when we schedule them on a GPU, it will receive X kernels, X fragments and execute kernels sequentially.

The currently disabled multifragment option does the following:

- We have X fragments, when we schedule them on a GPU, it will receive 1 kernel and X fragments and execute this one kernel.

Why is current multifragment good: 
- We avoid repeated administrative work around preparing X kernels, scheduling them and retrieving results. 
- Fine grained control (allocate/free) over memory of fragments.

Why is it not _that_ good:

- Currently we place fragments in GPU memory arbitrarily, this is taken care of in the code of GPU kernels which increases their complexity for the programmer, we cannot simply index into the multifragmented column and we have to do more checks per tuple (e.g.,  `JoinColumnIterator& operator++()` and related for loops). This likely eats some performance from the `executePlan` stage, which is often the costliest.
_____________

**Basic idea:**
The best case for the kernel is if all columns are linear (simple loops over dense data). All GPU-assigned fragments have to be _materialized anyways, so why not place them linearly_? 
We can add a new execution mode where fragments (regardless of their position in host memory) are assembled linearly on GPU, this way, the fragments become transparent for the GPU kernel, it will treat them as one big fragment. We still know the address of each fragment and can get it back to CPU, but we cannot free individual fragments, only if we do it for the whole linear buffer. Such behavior is not different from just a big fragment size. 


**Why is it supposed to be better?**
- Cheap initial data locality
- Easier indexing into a column -> easier and (likely) faster kernel code

**What's the catch?**
- We lose fine granularity of memory control on GPU. But do we _always care_?
- What if the next step takes more fragments? We know which fragments are still on GPU and their location, so one could avoid using bandwidth by memmove/memcpy'ing on the device itself and add only missing ones. If it is not possible, we pay the fetching price. Maybe this is a subject to more sophisticated data structures.

This won't bring any benefit for CPU as it is expensive and redundant (with regards to memory) to linearize fragments, also likely pointless anyways, since each thread processes one fragment (which is linear) at a time.


**When does this execution mode make sense:**
- Query is just one step (or more with the latter ones being more compute intense)
- Proportions do not change between steps (this is in a simple CPU+GPU scenario, but the more devices we have, the more likely we are to shuffle the fragments between them anyways)
- Pure plan execution speed, the workload is not severely bandwidth bound. 

Fetching chunks takes noticeably less time than `executePlan`, possible savings in `executePlan` might justify a small increase in latency caused by chunk placement. 


Summary:
GPUs will be able to treat any number of fragments as one big fragment of arbitrary size, which aligns with the goal of heterogeneity and GPU's preferred programming model. But there's a price: memory management granularity + bandwidth. The question is if the price is justified? After all, it will be just an option, so if the bandwidth is the bottleneck, one could just switch to the current way of multifragment execution. But if one knows that a workload needs only a handful of columns, why not get the most out of the hardware?

The purpose of the related work is to see if it brings noticeable benefits at all, if not, then we will at least know where not to look into.

Any tips, suggestion and critic is welcomed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

device-linear-multifrag execution mode #648

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

device-linear-multifrag execution mode #648

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions