-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Improve Spill Performance: mmap
the spill files
#15321
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
take |
BTW I am not sure the code is really in a great position to do this one yet -- it might help to wait for @2010YOUY01 (or help him) to pull some of the spilling code into its own structure -- see #15355 |
hi @alamb , Thanks for sharing your great example here. Following your approach, I noticed that when using an Arrow buffer with mmap as a field in the Decoder, running the tests produces the following output:
It seems that the drop method defined in memmap2 isn't being executed:
However, after switching to using mmap directly as a field, the related tests passed successfully. |
This error is triggered by DataFusion's internal memory tracking component, instead of thrown by the OS. So likely there is something wrong with the spill logic, could you share the draft code? |
In this issue, I've been experimenting with replacing StreamWriter/StreamReader with FileWriter/FileReader to evaluate potential performance improvements using mmap. However directly replacement seems to cause some other problems, see #14868. Those errors above are encountered with my own FileWriter and FileReader implementation. So I don't think it's an issue for our current spill implementation :) |
hi @2010YOUY01 , create a draft here, zebsme#3. This draft can reproduce the errors, and apply the code changes in comment will resolve all the test failures. I'm uncertain if the munmap is responsible for memory allocation failures - could you please help review this ? |
I took a look, but the reason for those test failures isn’t obvious to me. However, I think the implementation requires two changes:
|
Thanks for your reply @2010YOUY01 , and the benchmark shows that StreamReader with mmap has no performance improvement compared to current implementation:
So there is no need to enable mmap now, as we need stream format here. |
Maybe it's already using |
StreamReader with mmap enabled:
|
Is your feature request related to a problem or challenge?
Today when DataFusion spills files to disk, it uses the Arrow IPC format
Here is the code:
datafusion/datafusion/physical-plan/src/spill.rs
Lines 60 to 88 in 988a535
The IPC reader currently reads the spill files using file IO and into memory.
it is possible to use
mmap
to zero copy the contents of the files into memory. Here is an example of how to do so:https://github.com/apache/arrow-rs/blob/main/arrow/examples/zero_copy_ipc.rs
with_skip_validation
flag to IPCStreamReader
,FileReader
andFileDecoder
arrow-rs#7120 suggested mmap is 3x faster than file IODescribe the solution you'd like
I would like to see if using mmap to read the spill files back in is faster
Describe alternatives you've considered
Additional context
The text was updated successfully, but these errors were encountered: