-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP/RFC]: Refactor CUDA code out of cuCIM #464
base: branch-23.02
Are you sure you want to change the base?
Conversation
result[0] = myy / mu0; | ||
result[1] = result[2] = -mxy / mu0; | ||
result[3] = mxx / mu0; | ||
inertia_tensor_2x2(&mu[0], &result[0]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWICT CuPy actually passes CArray
objects into the kernel. These can be indexed though, which in the C-contiguous case indexes directly into the data_
pointer. We leverage this here to back out the pointer to underlying array data, which the kernel then takes for normal operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as long as we make sure to check that the inputs are C-contiguous, this seems okay
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a fallback for the non-contiguous case
That said, the C code effectively assumes the data is contiguous somehow (really C-contiguous). We could require that before calling the kernel. Alternatively we could rewrite the kernel so it is less reliant on this requirement. Though not sure how often more complex cases are needed given this seems to be an internal function. Please feel free to correct me if I'm missing something
@@ -519,20 +520,19 @@ def centroid(image, *, spacing=None): | |||
|
|||
|
|||
def _get_inertia_tensor_2x2_kernel(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def _get_inertia_tensor_2x2_kernel(): | |
@cp.memoize(for_each_device=True) | |
def _get_inertia_tensor_2x2_kernel(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would we still want to do this if we get #include
to work (thus not needing to read the file)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are now using #include
. Do we still want this or should we skip it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be okay with removing it, given that it should now be <1 microsecond benefit to having it. No strong opinion either way in this case.
This seems reasonable for quite a few of the kernels we have. Would you want to use this approach for even more trivial cases like the following two?: cucim/python/cucim/src/cucim/skimage/feature/corner.py Lines 357 to 377 in 6c50a4e
A slightly more involved example would be the following where there are two parameters that affect the generated code. The first part of the kernel is always the same, but the final portion can sort the eigenvalues in ascending or descending value, optionally via absolute values. cucim/python/cucim/src/cucim/skimage/feature/corner.py Lines 534 to 615 in 6c50a4e
The most complicated are in the cucim/python/cucim/src/cucim/skimage/_vendored/_ndimage_interp_kernels.py Lines 226 to 510 in 6c50a4e
There, the code is generated dynamically based on many attributes such as border mode, interpolation order and the number of dimensions which makes it very flexible, but quite hard to follow. I don't see a way to easily adapt those cases to this approach. |
Indeed this was my hope and thinking. Glad to hear you agree :)
I'm not sure. That probably also depends whether this kernel is useful in CUDA itself. Think, as you hint, it is probably not worth it for the trivial cases, but could be wrong. For more context picked this particular kernel since it was an easy one to try this idea out on, but still with enough meat on it to learn a few things along the way. It turned out to not be that difficult. So am encouraged we may be able to explore this strategy more.
Thanks for the suggestion! Can take a closer look.
Interesting. Maybe the final portion could be a second templated CUDA function called by the first?
Indeed. It probably can be done. Though may need to resort to |
rerun tests (one of the CI jobs hung) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this looks good. I agree the #include
approach is preferable to parsing the file in Python.
Thanks for giving this another look Greg 🙏 Given we have settled on a working pattern, may try to do this in a few more places |
(Tries to) address #375
This is a very rough attempt at starting to refactor CUDA code out of Python
str
s and into CUDA files. Starting with just one function to see how this goes and work through any issues that come up. If it is reasonable and proves viable, this could be expanded upon.