Almost 100x slowdown on 0.7.0 with CSV file due to parsing entire file to infer schema #2109

cube2222 · 2022-03-28T22:01:03Z

Describe the bug
I'm running benchmarks for OctoSQL and datafusion-cli is one of the tools I compare against. The previous version I used (0.6.0 I think) did the benchmark in 1.5 second. The new version takes 100 (!!!) seconds. It also prints "0 rows in set", which makes me think this is a CSV decoder regression.

This is based on the nyc yellow taxi dataset.

To Reproduce

curl https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2021-04.csv -o taxi.csv

echo "CREATE EXTERNAL TABLE taxi
STORED AS CSV
WITH HEADER ROW
LOCATION './taxi.csv';

SELECT passenger_count, COUNT(*), AVG(total_amount) FROM taxi GROUP BY passenger_count" > datafusion_commands.txt

datafusion-cli -f datafusion_commands.txt

Expected behavior
Datafusion is supposed to be much faster.

Additional context
Add any other context about the problem here.

matthewmturner · 2022-03-29T00:53:00Z

Hi @cube2222 - thanks for report. A few quick things that come to mind (I haven't had chance to try replicating or digging into it though so sorry if these are obvious or you already tried):

Can you confirm you are running with the --release flag?
The 0 rows in set likely comes from the CREATE EXTERNAL TABLE command.
I don't see a closing semi colon for the actual query which may prevent it from being run.

cube2222 · 2022-03-29T06:39:59Z

Hey @matthewmturner, sure:

I've installed datafusion-cli through homebrew, I'm assuming that's compiled with --release.
Yes, it does.
Running interactively it's already the CREATE EXTERNAL TABLE statement that takes ages. But you're right, the final query didn't run. With the semicolon, the final query takes 0.4 seconds (so as expected). So it's basically 100 second CREATE EXTERNAL TABLE + 0.4 second query.

Here's the full output:

datafusion-cli
DataFusion CLI v7.0.0
❯ CREATE EXTERNAL TABLE taxi
STORED AS CSV
WITH HEADER ROW
LOCATION './taxi.csv';
0 rows in set. Query took 106.102 seconds.
❯ SELECT passenger_count, COUNT(*), AVG(total_amount) FROM taxi GROUP BY passenger_count;
+-----------------+-----------------+------------------------+
| passenger_count | COUNT(UInt8(1)) | AVG(taxi.total_amount) |
+-----------------+-----------------+------------------------+
| 4               | 25510           | 18.452774990199917     |
| 9               | 1               | 113.6                  |
| 0               | 42228           | 17.021401676612687     |
| 5               | 50291           | 17.27092481756182      |
| 8               | 2               | 95.705                 |
|                 | 128020          | 32.237151148258164     |
| 2               | 286461          | 18.097587071189274     |
| 3               | 72852           | 17.915395871081138     |
| 1               | 1533197         | 17.6418833065818       |
| 6               | 32623           | 17.600296416638567     |
| 7               | 2               | 87.17                  |
+-----------------+-----------------+------------------------+
11 rows in set. Query took 0.385 seconds.
❯

matthewmturner · 2022-03-30T02:36:30Z

Indeed, I was able to reproduce the performance regression building from source:

Master (maybe a few commits behind, i havent pulled latest in a few days)

DataFusion CLI v7.0.0
❯ CREATE EXTERNAL TABLE taxi STORED AS CSV WITH HEADER ROW LOCATION './taxi.csv';
0 rows in set. Query took 84.245 seconds.

7.0.0

DataFusion CLI v7.0.0
❯ CREATE EXTERNAL TABLE taxi STORED AS CSV WITH HEADER ROW LOCATION './taxi.csv';
0 rows in set. Query took 112.486 seconds.

6.0.0 (I think the version shown when launching is wrong)

DataFusion CLI v5.1.0-SNAPSHOT

❯ CREATE EXTERNAL TABLE taxi STORED AS CSV WITH HEADER ROW LOCATION './taxi.csv';
0 rows in set. Query took 2.645 seconds.

My guess is that this could be coming from an arrow-rs related change (which handles IO) - but i havent been tracking all the changes in detail there lately. I likely wont have time to dig into this more for a few days.

@alamb does anything come to mind?

alamb · 2022-03-30T17:55:12Z

@alamb does anything come to mind?

Sometimes I have seen this kind of regression when there is no BufReader when doing IO. Let me poke at it a bit

alamb · 2022-03-30T18:30:29Z

I think something in the DF 7.0 line made the number of lines used to infer the schema configurable, and the default changed to use "the whole file".

Thus, in 7.0 the datafusion-cli appears to be parsing the entire CSV file to do schema inference.

When I applied the following diff, the time went from 131.012 seconds locally to 0.076 seconds.

(arrow_dev) alamb@MacBook-Pro-2:~/Software/arrow-datafusion$ git diff
diff --git a/datafusion/core/src/datasource/file_format/csv.rs b/datafusion/core/src/datasource/file_format/csv.rs
index 29ca84a12..c0a6307e8 100644
--- a/datafusion/core/src/datasource/file_format/csv.rs
+++ b/datafusion/core/src/datasource/file_format/csv.rs
@@ -95,7 +95,7 @@ impl FileFormat for CsvFormat {
     async fn infer_schema(&self, mut readers: ObjectReaderStream) -> Result<SchemaRef> {
         let mut schemas = vec![];
 
-        let mut records_to_read = self.schema_infer_max_rec.unwrap_or(std::usize::MAX);
+        let mut records_to_read = self.schema_infer_max_rec.unwrap_or(1000);
 
         while let Some(obj_reader) = readers.next().await {
             let mut reader = obj_reader?.sync_reader()?;
(arrow_dev) alamb@MacBook-Pro-2:~/Software/arrow-datafusion$

I suggest we change the default value of schema_infer_max_rec to something sensible like 100 or 1000. I think it is exceedingly rare to need to use more than this.

FYI @jychen7 if you are looking for good candidates for changes to backport for a 7.1 type release, this would be one :)

alamb · 2022-03-30T18:30:40Z

Thanks for the report @cube2222

cube2222 · 2022-03-30T20:25:50Z

Glad it's useful!

jychen7 · 2022-03-31T00:49:51Z

interesting, from the master branch as of now, the default 1000 is set in CsvReadOptions and should be passed when converted CsvReadOptions to ListingOptions
https://github.com/apache/arrow-datafusion/blob/c43b9ab9922ccbaaf6fe6f27e3d31201989edb1e/datafusion/core/src/execution/options.rs#L56-L66
https://github.com/apache/arrow-datafusion/blob/c43b9ab9922ccbaaf6fe6f27e3d31201989edb1e/datafusion/core/src/execution/options.rs#L106-L114

However, during DDL, the context use CsvFormat::default() directly, without CsvReadOptions to generate ListingOptions.
https://github.com/apache/arrow-datafusion/blob/c43b9ab9922ccbaaf6fe6f27e3d31201989edb1e/datafusion/core/src/execution/context.rs#L220-L223
https://github.com/apache/arrow-datafusion/blob/c43b9ab9922ccbaaf6fe6f27e3d31201989edb1e/datafusion/core/src/execution/context.rs#L239-L245

Haven't read 6.0.0 about why it is fast

jychen7 · 2022-03-31T01:16:29Z

I agree we can set default const 1000 at CsvFormat and use it across the code base, to align with other file format, we probably should set it for JSON as well? (avro and parquet is not applicable)

https://github.com/apache/arrow-datafusion/blob/c43b9ab9922ccbaaf6fe6f27e3d31201989edb1e/datafusion/core/src/datasource/file_format/csv.rs#L46-L54
https://github.com/apache/arrow-datafusion/blob/c43b9ab9922ccbaaf6fe6f27e3d31201989edb1e/datafusion/core/src/datasource/file_format/json.rs#L48-L50

jychen7 · 2022-04-02T14:46:48Z

I agree we can set default const 1000 at CsvFormat and use it across the code base, to align with other file format, we probably should set it for JSON as well? (avro and parquet is not applicable)

I will try create a PR this weekend and see if we can do our first minor release next weekend

Dandandan · 2022-04-03T08:07:41Z

Great find!👍

Another thing might be useful in the future is to optimize inferring inferring the types.

It makes sense it is slower than parsing the CSV, given that we don't know the types, but it sounds it shouldn't be ~100x as slow.

* set default schema infer max record * fix unrelated issue "error: format argument must be a string literal" during `cargo test`

jychen7 · 2022-04-05T02:10:51Z

if you are looking for good candidates for changes to backport for a 7.1 type release, this would be one :)

yes, #2159 is created

* set default schema infer max record * fix unrelated issue "error: format argument must be a string literal" during `cargo test` * fix clippy same as https://github.com/apache/arrow-datafusion/pull/1885/files which already in master

cube2222 added the bug Something isn't working label Mar 28, 2022

alamb changed the title ~~Almost 100x slowdown on 0.7.0 with CSV file~~ Almost 100x slowdown on 0.7.0 with CSV file due to parsing entire file to infer schema Mar 31, 2022

jychen7 mentioned this issue Apr 3, 2022

#2109 By default, use only 1000 rows to infer the schema #2139

Merged

alamb closed this as completed in #2139 Apr 4, 2022

alamb pushed a commit that referenced this issue Apr 4, 2022

#2109 schema infer max (#2139)

69ba713

* set default schema infer max record * fix unrelated issue "error: format argument must be a string literal" during `cargo test`

jychen7 mentioned this issue Apr 5, 2022

#2109 schema infer max #2159

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Almost 100x slowdown on 0.7.0 with CSV file due to parsing entire file to infer schema #2109

Almost 100x slowdown on 0.7.0 with CSV file due to parsing entire file to infer schema #2109

cube2222 commented Mar 28, 2022 •

edited

Loading

matthewmturner commented Mar 29, 2022

cube2222 commented Mar 29, 2022 •

edited

Loading

matthewmturner commented Mar 30, 2022

alamb commented Mar 30, 2022

alamb commented Mar 30, 2022

alamb commented Mar 30, 2022

cube2222 commented Mar 30, 2022

jychen7 commented Mar 31, 2022 •

edited

Loading

jychen7 commented Mar 31, 2022 •

edited

Loading

jychen7 commented Apr 2, 2022

Dandandan commented Apr 3, 2022

jychen7 commented Apr 5, 2022 •

edited

Loading

Almost 100x slowdown on 0.7.0 with CSV file due to parsing entire file to infer schema #2109

Almost 100x slowdown on 0.7.0 with CSV file due to parsing entire file to infer schema #2109

Comments

cube2222 commented Mar 28, 2022 • edited Loading

matthewmturner commented Mar 29, 2022

cube2222 commented Mar 29, 2022 • edited Loading

matthewmturner commented Mar 30, 2022

alamb commented Mar 30, 2022

alamb commented Mar 30, 2022

alamb commented Mar 30, 2022

cube2222 commented Mar 30, 2022

jychen7 commented Mar 31, 2022 • edited Loading

jychen7 commented Mar 31, 2022 • edited Loading

jychen7 commented Apr 2, 2022

Dandandan commented Apr 3, 2022

jychen7 commented Apr 5, 2022 • edited Loading

cube2222 commented Mar 28, 2022 •

edited

Loading

cube2222 commented Mar 29, 2022 •

edited

Loading

jychen7 commented Mar 31, 2022 •

edited

Loading

jychen7 commented Mar 31, 2022 •

edited

Loading

jychen7 commented Apr 5, 2022 •

edited

Loading