-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Further improve datafusion-cli memory usage if we setting huge number for maxrow size. #14810
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Basically the idea of this ticket is to print rows as they come in in batches rather than buffering them all up at once I think this will take some non trivial work as the formatter wants to know the width of all cells up front I believe Postgres does something like "compute column widths based on the first 1000 cells" and then just has a crappy display if the rows after that happen to have wider columns |
Redesign the datafusion-cli execution and print, make it totally streaming printing without memory overhead.
Submitted the PR, and i will continue testing the corner cases. Basic tests are all good. |
/usr/bin/time -l cargo run --profile release-nonlto -- -m 3G --mem-pool-type fair --maxrows 1 --format table -f '/Users/zhuqi/arrow-datafusion/benchmarks/data/external_sort.sql'
Compiling datafusion-cli v45.0.0 (/Users/zhuqi/arrow-datafusion/datafusion-cli)
Finished `release-nonlto` profile [optimized] target(s) in 23.18s
Running `/Users/zhuqi/arrow-datafusion/target/release-nonlto/datafusion-cli -m 3G --mem-pool-type fair --maxrows 1 --format table -f /Users/zhuqi/arrow-datafusion/benchmarks/data/external_sort.sql`
DataFusion CLI v45.0.0
0 row(s) fetched.
Elapsed 0.005 seconds.
+------------+-----------+-----------+--------------+------------+-----------------+------------+-------+------------+--------------+---------------+
| l_orderkey | l_partkey | l_suppkey | l_linenumber | l_quantity | l_extendedprice | l_discount | l_tax | l_shipdate | l_commitdate | l_receiptdate |
+------------+-----------+-----------+--------------+------------+-----------------+------------+-------+------------+--------------+---------------+
| 1 | 1551894 | 76910 | 1 | 17.00 | 33078.94 | 0.04 | 0.02 | 1996-03-13 | 1996-02-12 | 1996-03-22 |
| . | . | . | . | . | . | . | . | . | . | . |
| . | . | . | . | . | . | . | . | . | . | . |
| . | . | . | . | . | . | . | . | . | . | . |
+------------+-----------+-----------+--------------+------------+-----------------+------------+-------+------------+--------------+---------------+
59986052 row(s) fetched. (First 1 displayed. Use --maxrows to adjust)
Elapsed 4.252 seconds.
32.07 real 12.73 user 6.46 sys
3856441344 maximum resident set size
0 average shared memory size
0 average unshared data size
0 average unshared stack size
994646 page reclaims
0 page faults
0 swaps
0 block input operations
0 block output operations
0 messages sent
0 messages received
0 signals received
2279 voluntary context switches
98027 involuntary context switches
304329202988 instructions retired
72023941353 cycles elapsed
3847026176 peak memory footprint /usr/bin/time -l cargo run --profile release-nonlto -- -m 3G --mem-pool-type fair --maxrows 1 --format csv -f '/Users/zhuqi/arrow-datafusion/benchmarks/data/external_sort.sql'
Finished `release-nonlto` profile [optimized] target(s) in 0.32s
Running `/Users/zhuqi/arrow-datafusion/target/release-nonlto/datafusion-cli -m 3G --mem-pool-type fair --maxrows 1 --format csv -f /Users/zhuqi/arrow-datafusion/benchmarks/data/external_sort.sql`
DataFusion CLI v45.0.0
0 row(s) fetched.
Elapsed 0.006 seconds.
l_orderkey,l_partkey,l_suppkey,l_linenumber,l_quantity,l_extendedprice,l_discount,l_tax,l_shipdate,l_commitdate,l_receiptdate
1,1551894,76910,1,17.00,33078.94,0.04,0.02,1996-03-13,1996-02-12,1996-03-22
59986052 row(s) fetched. (First 1 displayed. Use --maxrows to adjust)
Elapsed 3.840 seconds.
8.47 real 12.51 user 6.20 sys
3736174592 maximum resident set size
0 average shared memory size
0 average unshared data size
0 average unshared stack size
954842 page reclaims
0 page faults
0 swaps
0 block input operations
0 block output operations
0 messages sent
0 messages received
0 signals received
1528 voluntary context switches
104430 involuntary context switches
293439737861 instructions retired
69591166435 cycles elapsed
3726603512 peak memory footprint |
Even we streaming printing all result, the memory usage is also same: /usr/bin/time -l cargo run --profile release-nonlto -- -m 3G --mem-pool-type fair --maxrows inf --format table -f '/Users/zhuqi/arrow-datafusion/benchmarks/data/external_sort.sql'
| 60000000 | 1258565 | 33602 | 2 | 15.00 | 22852.50 | 0.03 | 0.08 | 1997-11-03 | 1997-11-18 | 1997-11-05 |
| 60000000 | 698651 | 48664 | 3 | 46.00 | 75882.52 | 0.00 | 0.06 | 1997-09-04 | 1997-11-12 | 1997-09-05 |
| 60000000 | 224200 | 24201 | 4 | 37.00 | 41595.03 | 0.08 | 0.02 | 1997-11-17 | 1997-11-12 | 1997-12-14 |
| 60000000 | 118838 | 93842 | 5 | 28.00 | 51991.24 | 0.00 | 0.08 | 1997-09-29 | 1997-11-06 | 1997-09-30 |
| 60000000 | 1294851 | 19864 | 6 | 48.00 | 88597.92 | 0.03 | 0.07 | 1997-11-28 | 1997-10-05 | 1997-12-06 |
| 60000000 | 558286 | 33302 | 7 | 12.00 | 16131.12 | 0.02 | 0.05 | 1997-10-09 | 1997-10-27 | 1997-10-21 |
+------------+-----------+-----------+--------------+------------+-----------------+------------+-------+------------+--------------+---------------+
59986052 row(s) fetched.
Elapsed 309.217 seconds.
336.02 real 150.53 user 58.27 sys
3714342912 maximum resident set size
0 average shared memory size
0 average unshared data size
0 average unshared stack size
960345 page reclaims
0 page faults
0 swaps
0 block input operations
0 block output operations
0 messages sent
0 messages received
0 signals received
8604616 voluntary context switches
104476 involuntary context switches
2973463503658 instructions retired
712748119117 cycles elapsed
3704357512 peak memory footprint |
We had to revert this change temporarily to get the 46 release out So reopening the PR |
Is your feature request related to a problem or challenge?
This is a follow-up for the bellow comments:
#14766 (comment)
Describe the solution you'd like
Streaming datafusion-cli the print batch progress.
Describe alternatives you've considered
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: