Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GzipFile.readinto reads full file before copying into the provided buffer #128646

Closed
effigies opened this issue Jan 8, 2025 · 3 comments
Closed
Labels
stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@effigies
Copy link
Contributor

effigies commented Jan 8, 2025

Bug report

Bug description:

gzip.GzipFile uses the BufferedIOBase implementation of .readinto(), which simply calls .read and copies the result into a buffer. This negates the purpose of using .readinto() at all.

This may be considered more a missed optimization than a bug, but it is being reported in downstream tools and I've traced it back to CPython.

import os
from gzip import GzipFile

n_mbs = 50

with GzipFile('test.gz', mode='wb') as fobj:
    for _ in range(n_mbs):
        fobj.write(os.urandom(2**20))

buffer = bytearray(n_mbs * 2**20)

with GzipFile('test.gz', mode='rb') as fobj:
    fobj.readinto(buffer)
memray load_file.py
memray flamegraph memray-*.bin && rm memray-*.bin

Current memory profile

image

Duration: 0:00:01.821000
Total number of allocations: 5064
Total number of frames seen: 85
Peak memory usage: 116.3 MiB
Python allocator: pymalloc

Patched memory profile

image

Duration: 0:00:01.828000
Total number of allocations: 3317
Total number of frames seen: 79
Peak memory usage: 66.2 MiB
Python allocator: pymalloc

Patch

diff --git a/Lib/gzip.py b/Lib/gzip.py
index 1a3c82ce7e0..21bb4b085fd 100644
--- a/Lib/gzip.py
+++ b/Lib/gzip.py
@@ -338,6 +338,20 @@ def read1(self, size=-1):
             size = io.DEFAULT_BUFFER_SIZE
         return self._buffer.read1(size)
 
+    def readinto(self, b):
+        self._check_not_closed()
+        if self.mode != READ:
+            import errno
+            raise OSError(errno.EBADF, "readinto() on write-only GzipFile object")
+        return self._buffer.readinto(b)
+
+    def readinto1(self, b):
+        self._check_not_closed()
+        if self.mode != READ:
+            import errno
+            raise OSError(errno.EBADF, "readinto1() on write-only GzipFile object")
+        return self._buffer.readinto1(b)
+
     def peek(self, n):
         self._check_not_closed()
         if self.mode != READ:

I believe this should be an uncontroversial patch, so I will open a PR immediately.

cc @psadil

CPython versions tested on:

3.9, 3.10, 3.11, 3.12, 3.13, CPython main branch

Operating systems tested on:

Linux

Linked PRs

@effigies effigies added the type-bug An unexpected behavior, bug, or error label Jan 8, 2025
effigies added a commit to effigies/cpython that referenced this issue Jan 8, 2025

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
effigies added a commit to effigies/cpython that referenced this issue Jan 8, 2025
@ZeroIntensity ZeroIntensity added the stdlib Python modules in the Lib dir label Jan 8, 2025
@effigies
Copy link
Contributor Author

gzip.GzipFile uses the BufferedIOBase implementation of .readinto(), which simply calls .read and copies the result into a buffer. This negates the purpose of using .readinto() at all.

To expand on the above in hopes of saving some effort for a reviewer, here is my trace of calls:

gzip.GzipFile inherits from _compression.BaseStream, which inherits from io.BufferedIOBase.

These are the _pyio.py and bufferedio.c implementations of io.BufferedIOBase.readinto()/readinto1():

cpython/Lib/_pyio.py

Lines 694 to 732 in ec91e1c

def readinto(self, b):
"""Read bytes into a pre-allocated bytes-like object b.
Like read(), this may issue multiple reads to the underlying raw
stream, unless the latter is 'interactive'.
Returns an int representing the number of bytes read (0 for EOF).
Raises BlockingIOError if the underlying raw stream has no
data at the moment.
"""
return self._readinto(b, read1=False)
def readinto1(self, b):
"""Read bytes into buffer *b*, using at most one system call
Returns an int representing the number of bytes read (0 for EOF).
Raises BlockingIOError if the underlying raw stream has no
data at the moment.
"""
return self._readinto(b, read1=True)
def _readinto(self, b, read1):
if not isinstance(b, memoryview):
b = memoryview(b)
b = b.cast('B')
if read1:
data = self.read1(len(b))
else:
data = self.read(len(b))
n = len(data)
b[:n] = data
return n

static PyObject *
_bufferediobase_readinto_generic(PyObject *self, Py_buffer *buffer, char readinto1)
{
Py_ssize_t len;
PyObject *data;
PyObject *attr = readinto1
? &_Py_ID(read1)
: &_Py_ID(read);
data = _PyObject_CallMethod(self, attr, "n", buffer->len);
if (data == NULL)
return NULL;
if (!PyBytes_Check(data)) {
Py_DECREF(data);
PyErr_SetString(PyExc_TypeError, "read() should return bytes");
return NULL;
}
len = PyBytes_GET_SIZE(data);
if (len > buffer->len) {
PyErr_Format(PyExc_ValueError,
"read() returned too much data: "
"%zd bytes requested, %zd returned",
buffer->len, len);
Py_DECREF(data);
return NULL;
}
memcpy(buffer->buf, PyBytes_AS_STRING(data), len);
Py_DECREF(data);
return PyLong_FromSsize_t(len);
}
/*[clinic input]
@critical_section
_io._BufferedIOBase.readinto
buffer: Py_buffer(accept={rwbuffer})
/
[clinic start generated code]*/
static PyObject *
_io__BufferedIOBase_readinto_impl(PyObject *self, Py_buffer *buffer)
/*[clinic end generated code: output=8c8cda6684af8038 input=5273d20db7f56e1a]*/
{
return _bufferediobase_readinto_generic(self, buffer, 0);
}
/*[clinic input]
@critical_section
_io._BufferedIOBase.readinto1
buffer: Py_buffer(accept={rwbuffer})
/
[clinic start generated code]*/
static PyObject *
_io__BufferedIOBase_readinto1_impl(PyObject *self, Py_buffer *buffer)
/*[clinic end generated code: output=358623e4fd2b69d3 input=d6eb723dedcee654]*/
{
return _bufferediobase_readinto_generic(self, buffer, 1);
}

These call self.read()/read1(), which gzip.GzipFile proxies to an internal self._buffer, which is an io.BufferedReader():

cpython/Lib/gzip.py

Lines 206 to 210 in ec797d1

if mode.startswith('r'):
self.mode = READ
raw = _GzipReader(fileobj)
self._buffer = io.BufferedReader(raw)
self.name = filename

cpython/Lib/gzip.py

Lines 321 to 339 in ec797d1

def read(self, size=-1):
self._check_not_closed()
if self.mode != READ:
import errno
raise OSError(errno.EBADF, "read() on write-only GzipFile object")
return self._buffer.read(size)
def read1(self, size=-1):
"""Implements BufferedIOBase.read1()
Reads up to a buffer's worth of data if size is negative."""
self._check_not_closed()
if self.mode != READ:
import errno
raise OSError(errno.EBADF, "read1() on write-only GzipFile object")
if size < 0:
size = io.DEFAULT_BUFFER_SIZE
return self._buffer.read1(size)

Therefore, we are forcing self._buffer to grow its internal buffer to contain the full contents that we intend to pass into a buffer. Instead, we should directly call self._buffer.readinto()/readinto1(), which is implemented here:

cpython/Lib/_pyio.py

Lines 1144 to 1190 in ec91e1c

def _readinto(self, buf, read1):
"""Read data into *buf* with at most one system call."""
self._checkClosed("readinto of closed file")
# Need to create a memoryview object of type 'b', otherwise
# we may not be able to assign bytes to it, and slicing it
# would create a new object.
if not isinstance(buf, memoryview):
buf = memoryview(buf)
if buf.nbytes == 0:
return 0
buf = buf.cast('B')
written = 0
with self._read_lock:
while written < len(buf):
# First try to read from internal buffer
avail = min(len(self._read_buf) - self._read_pos, len(buf))
if avail:
buf[written:written+avail] = \
self._read_buf[self._read_pos:self._read_pos+avail]
self._read_pos += avail
written += avail
if written == len(buf):
break
# If remaining space in callers buffer is larger than
# internal buffer, read directly into callers buffer
if len(buf) - written > self.buffer_size:
n = self.raw.readinto(buf[written:])
if not n:
break # eof
written += n
# Otherwise refill internal buffer - unless we're
# in read1 mode and already got some data
elif not (read1 and written):
if not self._peek_unlocked(1):
break # eof
# In readinto1 mode, return as soon as we have some data
if read1 and written:
break
return written

static PyObject *
_buffered_readinto_generic(buffered *self, Py_buffer *buffer, char readinto1)
{
Py_ssize_t n, written = 0, remaining;
PyObject *res = NULL;
CHECK_INITIALIZED(self)
CHECK_CLOSED(self, "readinto of closed file")
n = Py_SAFE_DOWNCAST(READAHEAD(self), Py_off_t, Py_ssize_t);
if (n > 0) {
if (n >= buffer->len) {
memcpy(buffer->buf, self->buffer + self->pos, buffer->len);
self->pos += buffer->len;
return PyLong_FromSsize_t(buffer->len);
}
memcpy(buffer->buf, self->buffer + self->pos, n);
self->pos += n;
written = n;
}
if (!ENTER_BUFFERED(self))
return NULL;
if (self->writable) {
res = buffered_flush_and_rewind_unlocked(self);
if (res == NULL)
goto end;
Py_CLEAR(res);
}
_bufferedreader_reset_buf(self);
self->pos = 0;
for (remaining = buffer->len - written;
remaining > 0;
written += n, remaining -= n) {
/* If remaining bytes is larger than internal buffer size, copy
* directly into caller's buffer. */
if (remaining > self->buffer_size) {
n = _bufferedreader_raw_read(self, (char *) buffer->buf + written,
remaining);
}
/* In readinto1 mode, we do not want to fill the internal
buffer if we already have some data to return */
else if (!(readinto1 && written)) {
n = _bufferedreader_fill_buffer(self);
if (n > 0) {
if (n > remaining)
n = remaining;
memcpy((char *) buffer->buf + written,
self->buffer + self->pos, n);
self->pos += n;
continue; /* short circuit */
}
}
else
n = 0;
if (n == 0 || (n == -2 && written > 0))
break;
if (n < 0) {
if (n == -2) {
res = Py_NewRef(Py_None);
}
goto end;
}
/* At most one read in readinto1 mode */
if (readinto1) {
written += n;
break;
}
}
res = PyLong_FromSsize_t(written);
end:
LEAVE_BUFFERED(self);
return res;
}
/*[clinic input]
@critical_section
_io._Buffered.readinto
buffer: Py_buffer(accept={rwbuffer})
/
[clinic start generated code]*/
static PyObject *
_io__Buffered_readinto_impl(buffered *self, Py_buffer *buffer)
/*[clinic end generated code: output=bcb376580b1d8170 input=777c33e7adaa2bcd]*/
{
return _buffered_readinto_generic(self, buffer, 0);
}
/*[clinic input]
@critical_section
_io._Buffered.readinto1
buffer: Py_buffer(accept={rwbuffer})
/
[clinic start generated code]*/
static PyObject *
_io__Buffered_readinto1_impl(buffered *self, Py_buffer *buffer)
/*[clinic end generated code: output=6e5c6ac5868205d6 input=ef03cc5fc92a6895]*/
{
return _buffered_readinto_generic(self, buffer, 1);
}

This implementation avoids growing the internal buffer beyond an individual chunk.

@effigies
Copy link
Contributor Author

effigies commented Feb 3, 2025

Following the advice in https://devguide.python.org/getting-started/pull-request-lifecycle/, I'm pinging this issue after ~1mo to see if any reviewers have time to look at this.

I'm not sure if this is enough of a bug to get backported, so I'd like to make sure that it gets resolved in 3.14.

zware pushed a commit that referenced this issue Mar 8, 2025
The new methods simply delegate to the underlying buffer, much like the existing GzipFile.read[1] methods.  This avoids extra allocations caused by the BufferedIOBase.readinto implementation previously used.

This commit also factors out a common readability check rather than copying it an additional two times.
@effigies
Copy link
Contributor Author

Closed by #128647.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
Projects
Status: Done
Development

No branches or pull requests

2 participants