Improve Spill Performance: `mmap` the spill files #15321

alamb · 2025-03-19T20:54:55Z

part of [EPIC] A collection of tickets for improving sorting larger than memory datasets / spilling sorts #15271

Is your feature request related to a problem or challenge?

Today when DataFusion spills files to disk, it uses the Arrow IPC format

Here is the code:

datafusion/datafusion/physical-plan/src/spill.rs

Lines 60 to 88 in 988a535

    
           pub(crate) fn spill_record_batches( 
        
               batches: &[RecordBatch], 
        
               path: PathBuf, 
        
               schema: SchemaRef, 
        
           ) -> Result<(usize, usize)> { 
        
               let mut writer = IPCStreamWriter::new(path.as_ref(), schema.as_ref())?; 
        
               for batch in batches { 
        
                   writer.write(batch)?; 
        
               } 
        
               writer.finish()?; 
        
               debug!( 
        
                   "Spilled {} batches of total {} rows to disk, memory released {}", 
        
                   writer.num_batches, 
        
                   writer.num_rows, 
        
                   human_readable_size(writer.num_bytes), 
        
               ); 
        
               Ok((writer.num_rows, writer.num_bytes)) 
        
           } 
        
           fn read_spill(sender: Sender<Result<RecordBatch>>, path: &Path) -> Result<()> { 
        
               let file = BufReader::new(File::open(path)?); 
        
               let reader = StreamReader::try_new(file, None)?; 
        
               for batch in reader { 
        
                   sender 
        
                       .blocking_send(batch.map_err(Into::into)) 
        
                       .map_err(|e| exec_datafusion_err!("{e}"))?; 
        
               } 
        
               Ok(()) 
        
           }

The IPC reader currently reads the spill files using file IO and into memory.

it is possible to use mmap to zero copy the contents of the files into memory. Here is an example of how to do so:

https://github.com/apache/arrow-rs/blob/main/arrow/examples/zero_copy_ipc.rs

My testing on Add with_skip_validation flag to IPC StreamReader, FileReader and FileDecoder arrow-rs#7120 suggested mmap is 3x faster than file IO

Describe the solution you'd like

I would like to see if using mmap to read the spill files back in is faster

Describe alternatives you've considered

Use mmap to read spill files
Add / use a benchmark showing the peformance benefit of doing this

Additional context

Improve spill performance: Disable re-validation of spilled files #15320

The text was updated successfully, but these errors were encountered:

zebsme · 2025-03-22T16:04:55Z

take

alamb · 2025-03-22T16:24:46Z

BTW I am not sure the code is really in a great position to do this one yet -- it might help to wait for @2010YOUY01 (or help him) to pull some of the spilling code into its own structure -- see #15355

zebsme · 2025-03-26T11:50:24Z

hi @alamb , Thanks for sharing your great example here. Following your approach, I noticed that when using an Arrow buffer with mmap as a field in the Decoder, running the tests produces the following output:

thread 'memory_limit::test_stringview_external_sort' panicked at datafusion/core/tests/memory_limit/mod.rs:468:32:
Query execution failed: ResourcesExhausted("Failed to allocate additional 6643090 bytes for ExternalSorterMerge[0] with 27430266 bytes already allocated for this reservation - 1410938 bytes remain available for the total pool")

It seems that the drop method defined in memmap2 isn't being executed:

impl Drop for MmapInner {
    fn drop(&mut self) {
        let (ptr, len, _) = self.as_mmap_params();

        // Any errors during unmapping/closing are ignored as the only way
        // to report them would be through panicking which is highly discouraged
        // in Drop impls, c.f. https://github.com/rust-lang/lang-team/issues/97
        unsafe { libc::munmap(ptr, len as libc::size_t) };
    }
}

However, after switching to using mmap directly as a field, the related tests passed successfully.

2010YOUY01 · 2025-03-26T12:19:35Z

hi @alamb , Thanks for sharing your great example here. Following your approach, I noticed that when using an Arrow buffer with mmap as a field in the Decoder, running the tests produces the following output:
thread 'memory_limit::test_stringview_external_sort' panicked at datafusion/core/tests/memory_limit/mod.rs:468:32:
Query execution failed: ResourcesExhausted("Failed to allocate additional 6643090 bytes for ExternalSorterMerge[0] with 27430266 bytes already allocated for this reservation - 1410938 bytes remain available for the total pool")
It seems that the drop method defined in memmap2 isn't being executed:
impl Drop for MmapInner {
    fn drop(&mut self) {
        let (ptr, len, _) = self.as_mmap_params();

        // Any errors during unmapping/closing are ignored as the only way
        // to report them would be through panicking which is highly discouraged
        // in Drop impls, c.f. https://github.com/rust-lang/lang-team/issues/97
        unsafe { libc::munmap(ptr, len as libc::size_t) };
    }
}
However, after switching to using mmap directly as a field, the related tests passed successfully.

This error is triggered by DataFusion's internal memory tracking component, instead of thrown by the OS. So likely there is something wrong with the spill logic, could you share the draft code?

zebsme · 2025-03-26T12:51:08Z

hi @alamb , Thanks for sharing your great example here. Following your approach, I noticed that when using an Arrow buffer with mmap as a field in the Decoder, running the tests produces the following output:
thread 'memory_limit::test_stringview_external_sort' panicked at datafusion/core/tests/memory_limit/mod.rs:468:32:
Query execution failed: ResourcesExhausted("Failed to allocate additional 6643090 bytes for ExternalSorterMerge[0] with 27430266 bytes already allocated for this reservation - 1410938 bytes remain available for the total pool")
It seems that the drop method defined in memmap2 isn't being executed:
impl Drop for MmapInner {
    fn drop(&mut self) {
        let (ptr, len, _) = self.as_mmap_params();

        // Any errors during unmapping/closing are ignored as the only way
        // to report them would be through panicking which is highly discouraged
        // in Drop impls, c.f. https://github.com/rust-lang/lang-team/issues/97
        unsafe { libc::munmap(ptr, len as libc::size_t) };
    }
}
However, after switching to using mmap directly as a field, the related tests passed successfully.
This error is triggered by DataFusion's internal memory tracking component, instead of thrown by the OS. So likely there is something wrong with the spill logic, could you share the draft code?

In this issue, I've been experimenting with replacing StreamWriter/StreamReader with FileWriter/FileReader to evaluate potential performance improvements using mmap.

However directly replacement seems to cause some other problems, see #14868.

Those errors above are encountered with my own FileWriter and FileReader implementation. So I don't think it's an issue for our current spill implementation :)

zebsme · 2025-04-08T07:51:26Z

hi @2010YOUY01 , create a draft here, zebsme#3.

This draft can reproduce the errors, and apply the code changes in comment will resolve all the test failures.

I'm uncertain if the munmap is responsible for memory allocation failures - could you please help review this ?

2010YOUY01 · 2025-04-08T08:16:28Z

I took a look, but the reason for those test failures isn’t obvious to me. However, I think the implementation requires two changes:

Always use SpillManager's read utility to read spill, instead of using a lower-level utility function in SMJ (I forgot to update the spill read path in refactor: Use SpillManager for all spilling scenarios #15405 🤦‍♂, ~~will update it soon~~ it's hard to refactor SMJ code, maybe we can ignore this and use the existing hack)
We have to use IPCStreamWriter instead of IPCFileWriter, otherwise it will cause a regression for not supporting spilling dictionary type, see Use arrow IPC Stream format for spill files #14868. (unless we can also support spilling dictionaries in IPCFileWriter)

zebsme · 2025-04-08T09:14:27Z

Thanks for your reply @2010YOUY01 , and the benchmark shows that StreamReader with mmap has no performance improvement compared to current implementation:

spill_io/StreamReader/read_100/
                        time:   [7.1020 ms 7.2847 ms 7.4983 ms]
                        change: [-2.7305% +1.2574% +5.1899%] (p = 0.55 > 0.05)
                        No change in performance detected.

So there is no need to enable mmap now, as we need stream format here.

2010YOUY01 · 2025-04-08T12:10:45Z

Thanks for your reply @2010YOUY01 , and the benchmark shows that StreamReader with mmap has no performance improvement compared to current implementation:
spill_io/StreamReader/read_100/
                        time:   [7.1020 ms 7.2847 ms 7.4983 ms]
                        change: [-2.7305% +1.2574% +5.1899%] (p = 0.55 > 0.05)
                        No change in performance detected.
So there is no need to enable mmap now, as we need stream format here.

Maybe it's already using mmap 🤔 I'll double-check it when I start to look into the performance, thank you for the experiments.

zebsme · 2025-04-08T13:49:35Z

StreamReader with mmap enabled:

fn read_spill(sender: Sender<Result<RecordBatch>>, path: &Path) -> Result<()> {
    let file = File::open(path)?;
    let mmap = unsafe { memmap2::Mmap::map(&file)? };
    // SAFETY: DataFusion's spill writer strictly follows Arrow IPC specifications
    // with validated schemas and buffers. Skip redundant validation during read
    // to speedup read operation. This is safe for DataFusion as input guaranteed to be correct when written.
    let reader = unsafe {
        StreamReader::try_new(Cursor::new(mmap), None)?.with_skip_validation(true)
    };
    for batch in reader {
        sender
            .blocking_send(batch.map_err(Into::into))
            .map_err(|e| exec_datafusion_err!("{e}"))?;
    }
    Ok(())
}

alamb added the enhancement New feature or request label Mar 19, 2025

alamb mentioned this issue Mar 19, 2025

[EPIC] A collection of tickets for improving sorting larger than memory datasets / spilling sorts #15271

Open

18 tasks

github-actions bot assigned zebsme Mar 22, 2025

This was referenced Mar 27, 2025

Improve spill performance: Disable re-validation of spilled files #15454

Merged

Improve spill performance: Disable re-validation of spilled files #15320

Closed

westhide mentioned this issue Mar 28, 2025

Consider using with_skip_validation for shuffle file reading apache/datafusion-ballista#1189

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Spill Performance: `mmap` the spill files #15321

Improve Spill Performance: `mmap` the spill files #15321

alamb commented Mar 19, 2025 •

edited

Loading

zebsme commented Mar 22, 2025

alamb commented Mar 22, 2025

zebsme commented Mar 26, 2025

2010YOUY01 commented Mar 26, 2025

zebsme commented Mar 26, 2025

zebsme commented Apr 8, 2025 •

edited

Loading

2010YOUY01 commented Apr 8, 2025 •

edited

Loading

zebsme commented Apr 8, 2025

2010YOUY01 commented Apr 8, 2025

zebsme commented Apr 8, 2025

Improve Spill Performance: mmap the spill files #15321

Improve Spill Performance: mmap the spill files #15321

Comments

alamb commented Mar 19, 2025 • edited Loading

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

zebsme commented Mar 22, 2025

alamb commented Mar 22, 2025

zebsme commented Mar 26, 2025

2010YOUY01 commented Mar 26, 2025

zebsme commented Mar 26, 2025

zebsme commented Apr 8, 2025 • edited Loading

2010YOUY01 commented Apr 8, 2025 • edited Loading

zebsme commented Apr 8, 2025

2010YOUY01 commented Apr 8, 2025

zebsme commented Apr 8, 2025

Improve Spill Performance: `mmap` the spill files #15321

Improve Spill Performance: `mmap` the spill files #15321

alamb commented Mar 19, 2025 •

edited

Loading

zebsme commented Apr 8, 2025 •

edited

Loading

2010YOUY01 commented Apr 8, 2025 •

edited

Loading