Skip to content
This repository was archived by the owner on May 9, 2024. It is now read-only.

device-linear-multifrag execution modeΒ #648

@akroviakov

Description

@akroviakov

As of now, HDK's heterogeneity looks like this:

  • We have X fragments, when we schedule them on a GPU, it will receive X kernels, X fragments and execute kernels sequentially.

The currently disabled multifragment option does the following:

  • We have X fragments, when we schedule them on a GPU, it will receive 1 kernel and X fragments and execute this one kernel.

Why is current multifragment good:

  • We avoid repeated administrative work around preparing X kernels, scheduling them and retrieving results.
  • Fine grained control (allocate/free) over memory of fragments.

Why is it not that good:

  • Currently we place fragments in GPU memory arbitrarily, this is taken care of in the code of GPU kernels which increases their complexity for the programmer, we cannot simply index into the multifragmented column and we have to do more checks per tuple (e.g., JoinColumnIterator& operator++() and related for loops). This likely eats some performance from the executePlan stage, which is often the costliest.

Basic idea:
The best case for the kernel is if all columns are linear (simple loops over dense data). All GPU-assigned fragments have to be materialized anyways, so why not place them linearly?
We can add a new execution mode where fragments (regardless of their position in host memory) are assembled linearly on GPU, this way, the fragments become transparent for the GPU kernel, it will treat them as one big fragment. We still know the address of each fragment and can get it back to CPU, but we cannot free individual fragments, only if we do it for the whole linear buffer. Such behavior is not different from just a big fragment size.

Why is it supposed to be better?

  • Cheap initial data locality
  • Easier indexing into a column -> easier and (likely) faster kernel code

What's the catch?

  • We lose fine granularity of memory control on GPU. But do we always care?
  • What if the next step takes more fragments? We know which fragments are still on GPU and their location, so one could avoid using bandwidth by memmove/memcpy'ing on the device itself and add only missing ones. If it is not possible, we pay the fetching price. Maybe this is a subject to more sophisticated data structures.

This won't bring any benefit for CPU as it is expensive and redundant (with regards to memory) to linearize fragments, also likely pointless anyways, since each thread processes one fragment (which is linear) at a time.

When does this execution mode make sense:

  • Query is just one step (or more with the latter ones being more compute intense)
  • Proportions do not change between steps (this is in a simple CPU+GPU scenario, but the more devices we have, the more likely we are to shuffle the fragments between them anyways)
  • Pure plan execution speed, the workload is not severely bandwidth bound.

Fetching chunks takes noticeably less time than executePlan, possible savings in executePlan might justify a small increase in latency caused by chunk placement.

Summary:
GPUs will be able to treat any number of fragments as one big fragment of arbitrary size, which aligns with the goal of heterogeneity and GPU's preferred programming model. But there's a price: memory management granularity + bandwidth. The question is if the price is justified? After all, it will be just an option, so if the bandwidth is the bottleneck, one could just switch to the current way of multifragment execution. But if one knows that a workload needs only a handful of columns, why not get the most out of the hardware?

The purpose of the related work is to see if it brings noticeable benefits at all, if not, then we will at least know where not to look into.

Any tips, suggestion and critic is welcomed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions