Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP/RFC]: Refactor CUDA code out of cuCIM #464

Draft
wants to merge 3 commits into
base: branch-23.02
Choose a base branch
from

Conversation

jakirkham
Copy link
Member

@jakirkham jakirkham commented Dec 6, 2022

(Tries to) address #375

This is a very rough attempt at starting to refactor CUDA code out of Python strs and into CUDA files. Starting with just one function to see how this goes and work through any issues that come up. If it is reasonable and proves viable, this could be expanded upon.

@jakirkham jakirkham added improvement Improves an existing functionality non-breaking Introduces a non-breaking change labels Dec 6, 2022
result[0] = myy / mu0;
result[1] = result[2] = -mxy / mu0;
result[3] = mxx / mu0;
inertia_tensor_2x2(&mu[0], &result[0]);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWICT CuPy actually passes CArray objects into the kernel. These can be indexed though, which in the C-contiguous case indexes directly into the data_ pointer. We leverage this here to back out the pointer to underlying array data, which the kernel then takes for normal operation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as long as we make sure to check that the inputs are C-contiguous, this seems okay

Copy link
Member Author

@jakirkham jakirkham Dec 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a fallback for the non-contiguous case

That said, the C code effectively assumes the data is contiguous somehow (really C-contiguous). We could require that before calling the kernel. Alternatively we could rewrite the kernel so it is less reliant on this requirement. Though not sure how often more complex cases are needed given this seems to be an internal function. Please feel free to correct me if I'm missing something

@@ -519,20 +520,19 @@ def centroid(image, *, spacing=None):


def _get_inertia_tensor_2x2_kernel():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def _get_inertia_tensor_2x2_kernel():
@cp.memoize(for_each_device=True)
def _get_inertia_tensor_2x2_kernel():

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would we still want to do this if we get #include to work (thus not needing to read the file)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are now using #include. Do we still want this or should we skip it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be okay with removing it, given that it should now be <1 microsecond benefit to having it. No strong opinion either way in this case.

@grlee77
Copy link
Contributor

grlee77 commented Dec 6, 2022

This seems reasonable for quite a few of the kernels we have. Would you want to use this approach for even more trivial cases like the following two?:

@cp.memoize(for_each_device=True)
def _get_real_symmetric_2x2_det_kernel():
return cp.ElementwiseKernel(
in_params="F M00, F M01, F M11",
out_params="F det",
operation="det = M00 * M11 - M01 * M01;",
name="cucim_skimage_symmetric_det22_kernel")
@cp.memoize(for_each_device=True)
def _get_real_symmetric_3x3_det_kernel():
operation = """
det = M00 * (M11 * M22 - M12 * M12);
det -= M01 * (M01 * M22 - M12 * M02);
det += M02 * (M01 * M12 - M11 * M02);
"""
return cp.ElementwiseKernel(
in_params="F M00, F M01, F M02, F M11, F M12, F M22",
out_params="F det",
operation=operation,
name="cucim_skimage_symmetric_det33_kernel")

A slightly more involved example would be the following where there are two parameters that affect the generated code. The first part of the kernel is always the same, but the final portion can sort the eigenvalues in ascending or descending value, optionally via absolute values.

@cp.memoize(for_each_device=True)
def _get_real_symmetric_3x3_eigvals_kernel(sort='ascending', abs_sort=False):
operation = """
double x1, x2, phi;
double a = static_cast<double>(aa);
double b = static_cast<double>(bb);
double c = static_cast<double>(cc);
double d = static_cast<double>(dd);
double e = static_cast<double>(ee);
double f = static_cast<double>(ff);
double d_sq = d * d;
double e_sq = e * e;
double f_sq = f * f;
double tmpa = (2*a - b - c);
double tmpb = (2*b - a - c);
double tmpc = (2*c - a - b);
x2 = - tmpa * tmpb * tmpc;
x2 += 9 * (tmpc*d_sq + tmpb*f_sq + tmpa*e_sq);
x2 -= 54 * (d * e * f);
x1 = a*a + b*b + c*c - a*b - a*c - b*c + 3 * (d_sq + e_sq + f_sq);
// grlee77: added max() here for numerical stability
// (avoid NaN values in ridge filter test cases)
x1 = max(x1, 0.0);
if (x2 == 0.0) {
phi = M_PI / 2.0;
} else {
// grlee77: added max() here for numerical stability
// (avoid NaN values in test_hessian_matrix_eigvals_3d)
double arg = max(4*x1*x1*x1 - x2*x2, 0.0);
phi = atan(sqrt(arg)/x2);
if (x2 < 0) {
phi += M_PI;
}
}
double x1_term = (2.0 / 3.0) * sqrt(x1);
double abc = (a + b + c) / 3.0;
lam1 = abc - x1_term * cos(phi / 3.0);
lam2 = abc + x1_term * cos((phi - M_PI) / 3.0);
lam3 = abc + x1_term * cos((phi + M_PI) / 3.0);
"""
sort_template = """
F stmp;
if ({prefix}{var1} > {prefix}{var2}) {{
stmp = {var2};
{var2} = {var1};
{var1} = stmp;
}} if ({prefix}{var1} > {prefix}{var3}) {{
stmp = {var3};
{var3} = {var1};
{var1} = stmp;
}} if ({prefix}{var2} > {prefix}{var3}) {{
stmp = {var3};
{var3} = {var2};
{var2} = stmp;
}}
"""
if abs_sort:
operation += """
F abs_lam1 = abs(lam1);
F abs_lam2 = abs(lam2);
F abs_lam3 = abs(lam3);
"""
prefix = "abs_"
else:
prefix = ""
if sort == 'ascending':
var1 = "lam1"
var3 = "lam3"
elif sort == 'descending':
var1 = "lam3"
var3 = "lam1"
operation += sort_template.format(
prefix=prefix, var1=var1, var2="lam2", var3=var3
)
return cp.ElementwiseKernel(
in_params="F aa, F bb, F cc, F dd, F ee, F ff",
out_params="F lam1, F lam2, F lam3",
operation=operation,
name="cucim_skimage_symmetric_eig33_kernel")

The most complicated are in the skimage/_vendored folder such as this one used by transforms like rotate, warp, resize, etc.:

def _generate_interp_custom(coord_func, ndim, large_int, yshape, mode, cval,
order, name='', integer_output=False, nprepad=0,
omit_in_coord=False):
"""
Args:
coord_func (function): generates code to do the coordinate
transformation. See for example, `_get_coord_shift`.
ndim (int): The number of dimensions.
large_int (bool): If true use Py_ssize_t instead of int for indexing.
yshape (tuple): Shape of the output array.
mode (str): Signal extension mode to use at the array boundaries
cval (float): constant value used when `mode == 'constant'`.
name (str): base name for the interpolation kernel
integer_output (bool): boolean indicating whether the output has an
integer type.
nprepad (int): integer indicating the amount of prepadding at the
boundaries.
Returns:
operation (str): code body for the ElementwiseKernel
name (str): name for the ElementwiseKernel
"""
ops = []
internal_dtype = 'double' if integer_output else 'Y'
ops.append(f'{internal_dtype} out = 0.0;')
if large_int:
uint_t = 'size_t'
int_t = 'ptrdiff_t'
else:
uint_t = 'unsigned int'
int_t = 'int'
# determine strides for x along each axis
for j in range(ndim):
ops.append(f'const {int_t} xsize_{j} = x.shape()[{j}];')
ops.append(f'const {uint_t} sx_{ndim - 1} = 1;')
for j in range(ndim - 1, 0, -1):
ops.append(f'const {uint_t} sx_{j - 1} = sx_{j} * xsize_{j};')
if not omit_in_coord:
# create in_coords array to store the unraveled indices
ops.append(_unravel_loop_index(yshape, uint_t))
# compute the transformed (target) coordinates, c_j
ops = ops + coord_func(ndim, nprepad)
if cval is numpy.nan:
cval = '(Y)CUDART_NAN'
elif cval == numpy.inf:
cval = '(Y)CUDART_INF'
elif cval == -numpy.inf:
cval = '(Y)(-CUDART_INF)'
else:
cval = f'({internal_dtype}){cval}'
if mode == 'constant':
# use cval if coordinate is outside the bounds of x
_cond = ' || '.join(
[f'(c_{j} < 0) || (c_{j} > xsize_{j} - 1)' for j in range(ndim)])
ops.append(f'''
if ({_cond})
{{
out = {cval};
}}
else
{{''')
if order == 0:
if mode == 'wrap':
ops.append('double dcoord;') # mode 'wrap' requires this to work
for j in range(ndim):
# determine nearest neighbor
if mode == 'wrap':
ops.append(f'''
dcoord = c_{j};''')
else:
ops.append(f'''
{int_t} cf_{j} = ({int_t})floor((double)c_{j} + 0.5);''')
# handle boundary
if mode != 'constant':
if mode == 'wrap':
ixvar = 'dcoord'
float_ix = True
else:
ixvar = f'cf_{j}'
float_ix = False
ops.append(
_util._generate_boundary_condition_ops(
mode, ixvar, f'xsize_{j}', int_t, float_ix))
if mode == 'wrap':
ops.append(f'''
{int_t} cf_{j} = ({int_t})floor(dcoord + 0.5);''')
# sum over ic_j will give the raveled coordinate in the input
ops.append(f'''
{int_t} ic_{j} = cf_{j} * sx_{j};''')
_coord_idx = ' + '.join([f'ic_{j}' for j in range(ndim)])
if mode == 'grid-constant':
_cond = ' || '.join([f'(ic_{j} < 0)' for j in range(ndim)])
ops.append(f'''
if ({_cond}) {{
out = {cval};
}} else {{
out = ({internal_dtype})x[{_coord_idx}];
}}''')
else:
ops.append(f'''
out = ({internal_dtype})x[{_coord_idx}];''')
elif order == 1:
for j in range(ndim):
# get coordinates for linear interpolation along axis j
ops.append(f'''
{int_t} cf_{j} = ({int_t})floor((double)c_{j});
{int_t} cc_{j} = cf_{j} + 1;
{int_t} n_{j} = (c_{j} == cf_{j}) ? 1 : 2; // points needed
''')
if mode == 'wrap':
ops.append(f'''
double dcoordf = c_{j};
double dcoordc = c_{j} + 1;''')
else:
# handle boundaries for extension modes.
ops.append(f'''
{int_t} cf_bounded_{j} = cf_{j};
{int_t} cc_bounded_{j} = cc_{j};''')
if mode != 'constant':
if mode == 'wrap':
ixvar = 'dcoordf'
float_ix = True
else:
ixvar = f'cf_bounded_{j}'
float_ix = False
ops.append(
_util._generate_boundary_condition_ops(
mode, ixvar, f'xsize_{j}', int_t, float_ix))
ixvar = 'dcoordc' if mode == 'wrap' else f'cc_bounded_{j}'
ops.append(
_util._generate_boundary_condition_ops(
mode, ixvar, f'xsize_{j}', int_t, float_ix))
if mode == 'wrap':
ops.append(
f'''
{int_t} cf_bounded_{j} = ({int_t})floor(dcoordf);;
{int_t} cc_bounded_{j} = ({int_t})floor(dcoordf + 1);;
'''
)
ops.append(f'''
for (int s_{j} = 0; s_{j} < n_{j}; s_{j}++)
{{
W w_{j};
{int_t} ic_{j};
if (s_{j} == 0)
{{
w_{j} = (W)cc_{j} - c_{j};
ic_{j} = cf_bounded_{j} * sx_{j};
}} else
{{
w_{j} = c_{j} - (W)cf_{j};
ic_{j} = cc_bounded_{j} * sx_{j};
}}''')
elif order > 1:
if mode == 'grid-constant':
spline_mode = 'constant'
elif mode == 'nearest':
spline_mode = 'nearest'
else:
spline_mode = _spline_prefilter_core._get_spline_mode(mode)
# wx, wy are temporary variables used during spline weight computation
ops.append(f'''
W wx, wy;
{int_t} start;''')
for j in range(ndim):
# determine weights along the current axis
ops.append(f'''
W weights_{j}[{order + 1}];''')
ops.append(spline_weights_inline[order].format(j=j, order=order))
# get starting coordinate for spline interpolation along axis j
if mode in ['wrap']:
ops.append(f'double dcoord = c_{j};')
coord_var = 'dcoord'
ops.append(
_util._generate_boundary_condition_ops(
mode, coord_var, f'xsize_{j}', int_t, True))
else:
coord_var = f'(double)c_{j}'
if order & 1:
op_str = '''
start = ({int_t})floor({coord_var}) - {order_2};'''
else:
op_str = '''
start = ({int_t})floor({coord_var} + 0.5) - {order_2};'''
ops.append(
op_str.format(
int_t=int_t, coord_var=coord_var, order_2=order // 2
))
# set of coordinate values within spline footprint along axis j
ops.append(f'''{int_t} ci_{j}[{order + 1}];''')
for k in range(order + 1):
ixvar = f'ci_{j}[{k}]'
ops.append(f'''
{ixvar} = start + {k};''')
ops.append(
_util._generate_boundary_condition_ops(
spline_mode, ixvar, f'xsize_{j}', int_t))
# loop over the order + 1 values in the spline filter
ops.append(f'''
W w_{j};
{int_t} ic_{j};
for (int k_{j} = 0; k_{j} <= {order}; k_{j}++)
{{
w_{j} = weights_{j}[k_{j}];
ic_{j} = ci_{j}[k_{j}] * sx_{j};
''')
if order > 0:
_weight = ' * '.join([f'w_{j}' for j in range(ndim)])
_coord_idx = ' + '.join([f'ic_{j}' for j in range(ndim)])
if mode == 'grid-constant' or (order > 1 and mode == 'constant'):
_cond = ' || '.join([f'(ic_{j} < 0)' for j in range(ndim)])
ops.append(f'''
if ({_cond}) {{
out += {cval} * ({internal_dtype})({_weight});
}} else {{
{internal_dtype} val = ({internal_dtype})x[{_coord_idx}];
out += val * ({internal_dtype})({_weight});
}}''')
else:
ops.append(f'''
{internal_dtype} val = ({internal_dtype})x[{_coord_idx}];
out += val * ({internal_dtype})({_weight});''')
ops.append('}' * ndim)
if mode == 'constant':
ops.append('}')
if integer_output:
ops.append('y = (Y)rint((double)out);')
else:
ops.append('y = (Y)out;')
operation = '\n'.join(ops)
mode_str = mode.replace('-', '_') # avoid hyphen in kernel name
name = 'cupyx_scipy_ndimage_interpolate_{}_order{}_{}_{}d_y{}'.format(
name, order, mode_str, ndim, '_'.join([f'{j}' for j in yshape]),
)
if uint_t == 'size_t':
name += '_i64'
return operation, name
@cupy._util.memoize(for_each_device=True)
def _get_map_kernel(ndim, large_int, yshape, mode, cval=0.0, order=1,
integer_output=False, nprepad=0):
in_params = 'raw X x, raw W coords'
out_params = 'Y y'
operation, name = _generate_interp_custom(
coord_func=_get_coord_map,
ndim=ndim,
large_int=large_int,
yshape=yshape,
mode=mode,
cval=cval,
order=order,
name='map',
integer_output=integer_output,
nprepad=nprepad,
omit_in_coord=True, # input image coordinates are not needed
)
return cupy.ElementwiseKernel(in_params, out_params, operation, name,
preamble=math_constants_preamble)

There, the code is generated dynamically based on many attributes such as border mode, interpolation order and the number of dimensions which makes it very flexible, but quite hard to follow. I don't see a way to easily adapt those cases to this approach.

@jakirkham
Copy link
Member Author

This seems reasonable for quite a few of the kernels we have.

Indeed this was my hope and thinking. Glad to hear you agree :)

Would you want to use this approach for even more trivial cases like the following two?

I'm not sure. That probably also depends whether this kernel is useful in CUDA itself. Think, as you hint, it is probably not worth it for the trivial cases, but could be wrong.

For more context picked this particular kernel since it was an easy one to try this idea out on, but still with enough meat on it to learn a few things along the way. It turned out to not be that difficult. So am encouraged we may be able to explore this strategy more.

A slightly more involved example would be the following where there are two parameters that affect the generated code.

Thanks for the suggestion! Can take a closer look.

The first part of the kernel is always the same, but the final portion can sort the eigenvalues in ascending or descending value, optionally via absolute values.

Interesting. Maybe the final portion could be a second templated CUDA function called by the first?

There, the code is generated dynamically based on many attributes such as border mode, interpolation order and the number of dimensions which makes it very flexible, but quite hard to follow. I don't see a way to easily adapt those cases to this approach.

Indeed. It probably can be done. Though may need to resort to #define macros. That being said, as it is not low hanging fruit, it is probably best to save until the end or handle separately. Unless there is a compelling need to move it over to CUDA sooner.

@jakirkham jakirkham requested a review from grlee77 December 7, 2022 23:43
@jakirkham
Copy link
Member Author

rerun tests

(one of the CI jobs hung)

Copy link
Contributor

@grlee77 grlee77 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this looks good. I agree the #include approach is preferable to parsing the file in Python.

@jakirkham
Copy link
Member Author

Thanks for giving this another look Greg 🙏

Given we have settled on a working pattern, may try to do this in a few more places

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improves an existing functionality non-breaking Introduces a non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants