forked from scikit-learn/scikit-learn
-
Notifications
You must be signed in to change notification settings - Fork 0
C integer types: the missing manual
Lars Buitinck edited this page Oct 22, 2013
·
9 revisions
Here be dragons.
Throughout the scikit are bits and pieces written in Cython, and these commonly use C integers to index into arrays. Also, the Python code regularly creates arrays of C integers by np.where
and other means. Confusion arises every so often over what the correct type for integers is, esp. where we use them as indexes.
There are various types of C integers, the main ones being for our purposes:
-
int
: the "native" integer type. This once meant the size of a register, but on x86-64 this no longer seems to be true, asint
is 32 bits wide (for compatibility with i386?) while the registers and pointers are now 64 bits. -
size_t
: standard C89 type, defined in<stddef.h>
and a variety of standard C headers. Always unsigned. Large enough to hold the size of any object. This is the type of a Csizeof
expression, of the return value ofstrlen
, and its what functions likemalloc
,memcpy
,strcpy
etc. expect. Use when dealing with these functions. -
Py_ssize_t
: type defined in<Python.h>
and declared implicitly in Cython, that can hold the size (in bytes) of the largest object the Python interpreter ever creates. Index type forlist
. 63 bits + sign on x86-64; in general, the signed counterpart ofsize_t
, with the sign used for negative indices sol[-1]
works in C as well. Use when dealing with the CPython API. -
np.npy_intp
: type defined by NumPy that is always large enough to hold the value of a pointer, likeintptr_t
in C99. 63 bits + sign on x86-64, and probably always the size ofPy_ssize_t
, although there's no guarantee. Use when handing pointers to the NumPy C API.
Now to confound matters:
- BLAS uses
int
for all its integers, except (in ATLAS) the return value from certain functions, which issize_t
. It follows that when you call BLAS, you shouldn't except to be able to handle array dimensions ≥2³¹, akaINT_MAX
(from<limits.h>
). If this doesn't sound like much of a problem, consider the fast way to compute the Frobenius norm of a matrix:np.dot(X.ravel(), X.ravel())
. The ravel'd array has only one dimension, which may be ≥2³¹. This is no problem for NumPy's array data structure, butdot
may callcblas_ddot
which cannot handle the large array. Most likely, it'll go into an infinite loop, but this depends on the implementation. -
scipy.sparse
uses index arrays of typeint
to represent matrices in COO, CSC and CSR formats, so it has much the same limitation as BLAS.n_samples
,n_features
and the number of non-zero entries are all three limited to 2³¹-1. There's a PR to fix this and make sparse matrices switch to a larger index type when needed, but that's not been merged yet and when it is, we'll probably need to use fused types in all the sparse matrix-handling Cython code.
-- larsmans