-
-
Notifications
You must be signed in to change notification settings - Fork 31.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gh-128646: Implement GzipFile.readinto() functions #128647
Conversation
e94363a
to
4eea459
Compare
4eea459
to
f272a9c
Compare
f272a9c
to
db7f971
Compare
Disclaimer: not a subject matter expert. The change looks fine to me. However, I can't shake the feeling that this seems like it ought to be fixable somewhere in I would like to see some basic tests added for |
Tests added. I adapted I couldn't see how to make either of them a regression test for this issue.
Here's a possible diff for `_pyio.py`:diff --git a/Lib/_pyio.py b/Lib/_pyio.py
index 023478aa78c..9197c8628d5 100644
--- a/Lib/_pyio.py
+++ b/Lib/_pyio.py
@@ -721,15 +721,17 @@ def _readinto(self, b, read1):
b = memoryview(b)
b = b.cast('B')
- if read1:
+ nread = 0
+ while True:
data = self.read1(len(b))
- else:
- data = self.read(len(b))
- n = len(data)
-
- b[:n] = data
+ n = len(data)
+ b[:n] = data
+ nread += n
+ if read1:
+ break
+ b = b[n:]
- return n
+ return nread
def write(self, b):
"""Write the given bytes buffer to the IO stream. I could try updating the C implementation along those lines as well... |
Not a Python core dev, but have been spending a lot of time around Currently the C has a generic For testing, the ideal would be something like Go's This adds two more copies of the snippet:
That now has at least 4 copies, probably should be turned into a helper function |
I adapted my above Python code to the C implementation:diff --git a/Modules/_io/bufferedio.c b/Modules/_io/bufferedio.c
index e45323c93a1..7cdd8c28283 100644
--- a/Modules/_io/bufferedio.c
+++ b/Modules/_io/bufferedio.c
@@ -49,35 +49,46 @@ static PyObject *
_bufferediobase_readinto_generic(PyObject *self, Py_buffer *buffer, char readinto1)
{
Py_ssize_t len;
+ Py_ssize_t nread;
PyObject *data;
- PyObject *attr = readinto1
- ? &_Py_ID(read1)
- : &_Py_ID(read);
- data = _PyObject_CallMethod(self, attr, "n", buffer->len);
- if (data == NULL)
- return NULL;
+ PyObject *attr = &_Py_ID(read1);
- if (!PyBytes_Check(data)) {
- Py_DECREF(data);
- PyErr_SetString(PyExc_TypeError, "read() should return bytes");
- return NULL;
- }
+ nread = 0;
+ do {
+ data = _PyObject_CallMethod(self, attr, "n", buffer->len - nread);
+ if (data == NULL)
+ return NULL;
- len = PyBytes_GET_SIZE(data);
- if (len > buffer->len) {
- PyErr_Format(PyExc_ValueError,
- "read() returned too much data: "
- "%zd bytes requested, %zd returned",
- buffer->len, len);
- Py_DECREF(data);
- return NULL;
- }
- memcpy(buffer->buf, PyBytes_AS_STRING(data), len);
+ if (!PyBytes_Check(data)) {
+ Py_DECREF(data);
+ PyErr_SetString(PyExc_TypeError, "read() should return bytes");
+ return NULL;
+ }
+
+ len = PyBytes_GET_SIZE(data);
+ if (len == 0) {
+ break;
+ }
+ if (nread + len > buffer->len) {
+ PyErr_Format(PyExc_ValueError,
+ "read1() returned too much data: "
+ "%zd bytes requested, %zd returned",
+ buffer->len, nread + len);
+ Py_DECREF(data);
+ return NULL;
+ }
+ memcpy(buffer->buf + nread, PyBytes_AS_STRING(data), len);
+ nread += len;
+
+ if (readinto1) {
+ break;
+ }
+ } while (nread < buffer->len);
Py_DECREF(data);
- return PyLong_FromSsize_t(len);
+ return PyLong_FromSsize_t(nread);
}
/*[clinic input] Using the memray test in #128646, it doesn't achieve the desired result. I suspect this is because it still has to go through That does seem like a bigger problem, and I'd be happy to tackle it in another PR, but that could be a rabbit hole, while this is an achievable win right now. |
Okay, I got nerd-sniped and have now convinced myself that it is absolutely not worth the trouble of resolving this in Even if we can resolve this in
We could just use Lines 9 to 18 in 1b27f36
That exception subclasses |
👍 to not doing |
@zware Just checking: Are you waiting on something from me, or are you leaving approval to a subject matter expert? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a good step for now. I looked at doing the BufferedIOBase changes, and it's complicated as the current code says "readinto is always implemented on top of read", and that would need to be changed to "read is implemented on top of readinto, if readinto is provided". In both cases, readall
(size=-1) could be implemented on top of readinto, which likely would be a win, but not the problem this issue is looking at / trying to solve.
🤖 New build scheduled with the buildbot fleet by @zware for commit 9ded50d 🤖 Results will be shown at: https://buildbot.python.org/all/#/grid?branch=refs%2Fpull%2F128647%2Fmerge If you want to schedule another build, you need to add the 🔨 test-with-buildbots label again. |
I was hoping an actual expert would materialize, but at this point I'm ready to just go ahead with this as long as the buildbots are happy with it. I expect to merge this later today, but if I haven't made it back by Wednesday1 please ping me to get this in before 3.14.0a6. Footnotes
|
|
The bot failure looks like an unrelated failure; the |
@Yhg1s I just want to check if this can be considered for backporting to 3.12 and 3.13. It resolves what I consider a bug, but you may consider it an optimization. |
IMO it's an optimization and it should not be backported. |
This PR adds implementations of
.readinto()
and.readinto1()
for GzipFile which pass the received buffer to the underlyingBufferedReader
. Without this, the baseBufferedIOReader
implementation is used, which simply calls.read()
and copies into the target buffer.I don't know a good way to write a test to check memory usage, but see #128646 for a profile that demonstrates the effectiveness of the patch.