Skip to content

RFC: Expand documentation of seek, position and related functions #58024

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

jakobnissen
Copy link
Member

This commit intentionally omits expanding the documentation of mark etc., which although related, is a somewhat different mechanism.

I'm interested in discussion about what the right behaviour is:

  • Here, I documented seek is able to seek past the end of the IO, if the IO is seekable, in which case writing to it will fill in zero bytes. But I don't actually like this. It feels like that API is mixing up the behaviour of truncate and seek. It really feels like this should be an error. It also makes the behaviour of skip more awkward, because implementing this by reading a bunch of bytes no longer corresponds to seeking. It also creates a weird mismatch between reading and writing IOs. IMO, the congruence with libc's lseek is not enough to warrant this. IMO this should throw an exception when the seeking goes out of bounds in either direction.

  • This PR kind of chickens out and doesn't go ahead and say that position should return a zero-based Int. Having it be more generic is nice for e.g. BGZF's virtualoffset, but it also makes it harder to use. Probably, IOs will not document what their position returns, so in practice, users will just rely on it returning a zero-based Int without it being documented.

  • A comment from Jameson suggests position's return value should support arithmetic. That makes sense, since it then allows skip and relative seeking. However, the only non-integer position type I know of, BGZF's virtual offset, does not support arithmetic. So it feels like documenting this gets the worst of both worlds: It neither enables custom position types, nor do you get the convenience of knowing the return value is a zero-indexed Int

TODO

  • Discuss the points above and reach a conclusion
  • Make sure Base's types actually follow this behaviour. They currently behave inconsistently.

Closes #57618
Closes #57639

This commit intentionally omits expanding the documentation of `mark` etc.,
which although related, is a somewhat different mechanism.
@jakobnissen jakobnissen added needs decision A decision on this change is needed io Involving the I/O subsystem: libuv, read, write, etc. labels Apr 6, 2025
@jakobnissen jakobnissen changed the title RFC: Expand documentation of seek, position` and related functions RFC: Expand documentation of seek, position and related functions Apr 6, 2025
@vtjnash
Copy link
Member

vtjnash commented Apr 10, 2025

LGTM. My understanding is that BGZF implementation does not provide position (since it is annoying to implement), but if it did, it would not be based on virtualoffset but rather would follow this documentation

The `pos` argument should be a value that can be returned by `position(s)`.
Seeking before the first position should throw an error. Seeking after the
final position should throw an error if the stream is not writable.
If the stream is writable, reading will throw an error, and writing will fill
Copy link
Member

@nhz2 nhz2 Apr 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be okay for read to return an empty vector instead of throwing an error. That is how IOStream, and GZipStream currently behave after seeking past the end.

@@ -190,6 +203,14 @@ end

Seek a stream relative to the current position.

This is equivalent to `seek(s, position(s) + offset)`, except where seeking would
seek beyond the end of the stream, where `skip` will throw an `EOFError`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be better to leave the case of skip beyond the end of the file as implementation-defined behavior.
Allowing skip to go beyond the end of the stream to match the behavior of IOStream in TranscodingStreams is possible, but I think it would require keeping track of a bunch of additional states like zlib does in https://github.com/madler/zlib/blob/5a82f71ed1dfc0bec044d9702463dbdf84ea3b71/gzguts.h#L188-L196, basically making skip lazy and then running any requested skips on every read. https://github.com/madler/zlib/blob/5a82f71ed1dfc0bec044d9702463dbdf84ea3b71/gzread.c#L276-L281

Co-authored-by: Jameson Nash <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
io Involving the I/O subsystem: libuv, read, write, etc. needs decision A decision on this change is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Semantics of position Semantics of seek
3 participants