C integer types: the missing manual

Here be dragons.

Throughout the scikit are bits and pieces written in Cython, and these commonly use C integers to index into arrays. Also, the Python code regularly creates arrays of C integers by np.where and other means. Confusion arises every so often over what the correct type for integers is, esp. where we use them as indexes.

There are various types of C integers, the main ones being for our purposes:

int: the "native" integer type. This once meant the size of a register, but on x86-64 this no longer seems to be true, as int is 32 bits wide (for compatibility with i386?) while the registers and pointers are now 64 bits.
size_t: standard C89 type, defined in <stddef.h> and a variety of standard C headers. Always unsigned. Large enough to hold the size of any object. This is the type of a C sizeof expression, of the return value of strlen, and its what functions like malloc, memcpy, strcpy etc. expect. Use when dealing with these functions.
Py_ssize_t: type defined in <Python.h> and declared implicitly in Cython, that can hold the size (in bytes) of the largest object the Python interpreter ever creates. Index type for list. 63 bits + sign on x86-64; in general, the signed counterpart of size_t, with the sign used for negative indices so l[-1] works in C as well. Use when dealing with the CPython API.
np.npy_intp: type defined by NumPy that is always large enough to hold the value of a pointer, like intptr_t in C99. 63 bits + sign on x86-64, and probably always the size of Py_ssize_t, although there's no guarantee. Use when handing pointers to the NumPy C API.

Now to confound matters:

BLAS uses int for all its integers, except (in ATLAS) the return value from certain functions, which is size_t. It follows that when you call BLAS, you shouldn't except to be able to handle array dimensions ≥2³¹, aka INT_MAX (from <limits.h>). If this doesn't sound like much of a problem, consider the fast way to compute the Frobenius norm of a matrix: np.dot(X.ravel(), X.ravel()). The ravel'd array has only one dimension, which may be ≥2³¹. This is no problem for NumPy's array data structure, but dot may call cblas_ddot which cannot handle the large array. Most likely, it'll go into an infinite loop, but this depends on the implementation.
scipy.sparse uses index arrays of type int to represent matrices in COO, CSC and CSR formats, so it has much the same limitation as BLAS. n_samples, n_features and the number of non-zero entries are all three limited to 2³¹-1. There's a PR to fix this and make sparse matrices switch to a larger index type when needed, but that's not been merged yet and when it is, we'll probably need to use fused types in all the sparse matrix-handling Cython code.

-- larsmans

C integer types: the missing manual

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally