Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HAL server failing with large kernels #514

Open
ikabadzhov opened this issue Aug 5, 2024 · 1 comment
Open

HAL server failing with large kernels #514

ikabadzhov opened this issue Aug 5, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@ikabadzhov
Copy link

Version

latest master / 4.0.0

What behaviour are you expecting?

I was reproducing the server-client setup via HAL server as in https://github.com/codeplaysoftware/oneapi-construction-kit/tree/main/examples/hal_cpu_remote_server, then I noticed my big kernels are erroring out on (both of) my RISC-V device(s). I am sure the (both) device(s) have sufficient memory, and in fact the allocation takes place as expected.

What actual behaviour are you seeing?

I am seeing the following from the local client (first lines as expected):

$ HAL_REMOTE_PORT=5906 ./test $((1<<25))
Running on ock cpu
Allocated 128 MB

$ HAL_REMOTE_PORT=5906 ./test $((1<<26))
Running on ock cpu
Allocated 256 MB
terminate called after throwing an instance of 'sycl::_V1::runtime_error'
  what():  Native API failed. Native API returns: -999 (Unknown PI error) -999 (Unknown PI error)
Aborted (core dumped)

and on the RISC-V server, I get a seg fault shortly as: Segmentation fault (core dumped). And after that, attempting to restart the server on the same port, I fail with Unable to start server on requested port 5906, node 127.0.0.1.

On the other hand, empty kernel, or no kernel at all is OK.

What steps are required to reproduce the bug?

To reproduce, on the client side:

#include <sycl/sycl.hpp>

int main(int argc, char **argv) {
  unsigned long long len = 1 << 28;
  if (argc > 1) {
    len = std::stoull(argv[1]);
  }

  sycl::queue queue(sycl::accelerator_selector_v);
  std::cout << "Running on " << queue.get_device().get_info<sycl::info::device::name>() << std::endl;
  float *d_a = sycl::malloc_device<float>(len, queue);
  queue.wait();
  std::cout << "Allocated " << len * sizeof(float) / 1024 / 1024 << " MB" << std::endl;
  queue.parallel_for(sycl::range<1>(len), [=](sycl::id<1> idx) {
    d_a[idx] = idx;
  }).wait();
  return 0;
}

On the server, simply listen on a port as usual.

Minimal test case

No response

Anything else we should know?

No response

@ikabadzhov ikabadzhov added the bug Something isn't working label Aug 5, 2024
@coldav
Copy link
Collaborator

coldav commented Jan 24, 2025

Hi, Sorry for the delay. the server rests on top of cpu hal, which just sets a fixed amount of memory. Could you try upping the amount here https://github.com/uxlfoundation/oneapi-construction-kit/blob/main/clik/external/hal_cpu/source/cpu_hal.cpp#L40? We can look at ways of making the programmatical or possibly asking the system, but would be good to check.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants