- 
                Notifications
    You must be signed in to change notification settings 
- Fork 0
C integer types: the missing manual
Here be dragons.
Throughout the scikit are bits and pieces written in Cython, and these commonly use C integers to index into arrays. Also, the Python code regularly creates arrays of C integers by np.where and other means. Confusion arises every so often over what the correct type for integers is, esp. where we use them as indices.
There are various types of C integers, the main ones being for our purposes:
- 
int: the "native" integer type. This once meant the size of a register, but on x86-64 this no longer seems to be true, asintis 32 bits wide while the registers and pointers are now 64 bits.
- 
size_t: standard C89 type, defined in<stddef.h>and a variety of standard C headers. Always unsigned. Large enough to hold the size of any object. This is the type of a Csizeofexpression, of the return value ofstrlen, and its what functions likemalloc,memcpyandstrcpyexpect. Use when dealing with these functions.
- 
Py_ssize_t: type defined in<Python.h>and declared implicitly in Cython, that can hold the size (in bytes) of the largest object the Python interpreter ever creates. Index type forlist. 63 bits + sign on x86-64; in general, the signed counterpart ofsize_t, with the sign used for negative indices sol[-1]works in C as well. Use when dealing with the CPython API.
- 
np.npy_intp: type defined by the NumPy Cython module that is always large enough to hold the value of a pointer, likeintptr_tin C99. 63 bits + sign on x86-64, and probably always the size ofPy_ssize_t, although there's no guarantee. Use for indices into NumPy arrays; the NumPy C API expects this type.
Now to confound matters:
- 
BLAS uses intfor all its integers, except (in ATLAS) the return value from certain functions, which issize_t. It follows that when you call BLAS, you shouldn't expect to be able to handle array dimensions ≥2³¹, akaINT_MAX(from<limits.h>). If this doesn't sound like much of a problem, consider the fast way to compute the Frobenius norm of a matrix:np.dot(X.ravel(), X.ravel()). The ravel'd array has only one dimension, which may be ≥2³¹. This is no problem for NumPy's array data structure, butdotmay callcblas_ddotwhich cannot handle the large array. Most likely, it'll process only part of the array, but this depends on the implementation.
- 
scipy.sparseuses index arrays of typeintto represent matrices in COO, CSC and CSR formats, so it has much the same limitation as BLAS.n_samples,n_featuresand the number of non-zero entries are all three limited to 2³¹-1. There's a PR to fix this and make sparse matrices switch to a larger index type when needed, but that's not been merged yet and when it is, we'll probably need to use fused types in all the sparse matrix-handling Cython code.
- 
Since npy_intpis an alias at the C level, NumPy has no way of showing that a variable is of this type in Python. Instead, it shows the actual type, so on x86-64 (but not on i386, and probably not on ARM), you'll get results like>>> type(np.intp(1)) # corresponds to npy_intp <type 'numpy.int64'> >>> type(np.intc(1)) # corresponds to a C "int" <type 'numpy.int32'> >>> np.where([True])[0].dtype dtype('int64') # actually an npy_intp
- 
np.random.randintreturns a Pythonint(variable-size integer) when asked for one number. When asked for an array, it returns either 32-bit or 64-bit integers depending onsizeof(long); this is hardcoded in the C implementation. On most platforms, this conforms to the size ofnpy_intp, but again there's no guarantee and getting random indices can be tricky.
-- larsmans