Skip to content

Race condition in list.remove under free-threading #148259

@KowalskiThomas

Description

@KowalskiThomas

Bug report

Bug

I found a race condition on the free-threaded build at 72eca2af59043c78647b0e6be3777a947ea9ef0f happening in list.remove.

The long story short is if whatever PyObject_RichCompareBool does releases the GIL, the critical section no longer protects the list's state from other threads, which may then mutate the list and result in the wrong item being removed later.

This is the current code for list.remove with comments around the problematic logic.

static PyObject *
list_remove_impl(PyListObject *self, PyObject *value)
{
    Py_ssize_t i;
    for (i = 0; i < Py_SIZE(self); i++) {
        // read the contents under lock
        PyObject *obj = self->ob_item[i];
        Py_INCREF(obj);
        // this may release the GIL, suspending the critical section
        int cmp = PyObject_RichCompareBool(obj, value, Py_EQ);
        Py_DECREF(obj);
        // it is possible that obj (which we want to remove) isn't at
        // position i anymore as other threads may have mutated 
        // the list...
        if (cmp > 0) {
            // ... but we still remove it!
            if (list_ass_slice_lock_held(self, i, i+1, NULL) == 0)
                Py_RETURN_NONE;
            return NULL;
        }
    }
    ...
}

I'm not sure whether a crash can occur from that race condition (e.g. i is the position of the last element in the list and another thread pops the last element), but regardless I think there's a correctness issue with that race condition.

It's a bit hard to come up with a reproducer due to the race condition happening under pretty specific circumstances but I can try to make (or generate) one if needed.

Possible fixes

The simplest thing I can think of is re-comparing ob_item[i] with obj after PyObject_RichCompareBool has returned to make sure the state of the list, at least as far as this specific item's position is concerned, hasn't changed, and raising a RuntimeError if not.

Alternatively, I looked at other containers and it seems like deque has a state field which is incremented every time a mutating operation is done on it; if the state value changes in an unexpected manner during mutations, it raises a RuntimeError (I don't think list has anything like that currently).

I already have a branch in my fork that implements the former "fix", but I'm also open to working on the latter if there's interest in it.

CPython versions tested on:

CPython main branch

Operating systems tested on:

macOS

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions