Skip to content

Almost 100x slowdown on 0.7.0 with CSV file due to parsing entire file to infer schema #2109

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
cube2222 opened this issue Mar 28, 2022 · 12 comments · Fixed by #2139
Closed
Labels
bug Something isn't working

Comments

@cube2222
Copy link

cube2222 commented Mar 28, 2022

Describe the bug
I'm running benchmarks for OctoSQL and datafusion-cli is one of the tools I compare against. The previous version I used (0.6.0 I think) did the benchmark in 1.5 second. The new version takes 100 (!!!) seconds. It also prints "0 rows in set", which makes me think this is a CSV decoder regression.

This is based on the nyc yellow taxi dataset.

To Reproduce

curl https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2021-04.csv -o taxi.csv

echo "CREATE EXTERNAL TABLE taxi
STORED AS CSV
WITH HEADER ROW
LOCATION './taxi.csv';

SELECT passenger_count, COUNT(*), AVG(total_amount) FROM taxi GROUP BY passenger_count" > datafusion_commands.txt

datafusion-cli -f datafusion_commands.txt

Expected behavior
Datafusion is supposed to be much faster.

Additional context
Add any other context about the problem here.

@cube2222 cube2222 added the bug Something isn't working label Mar 28, 2022
@matthewmturner
Copy link
Contributor

Hi @cube2222 - thanks for report. A few quick things that come to mind (I haven't had chance to try replicating or digging into it though so sorry if these are obvious or you already tried):

  1. Can you confirm you are running with the --release flag?
  2. The 0 rows in set likely comes from the CREATE EXTERNAL TABLE command.
  3. I don't see a closing semi colon for the actual query which may prevent it from being run.

@cube2222
Copy link
Author

cube2222 commented Mar 29, 2022

Hey @matthewmturner, sure:

  1. I've installed datafusion-cli through homebrew, I'm assuming that's compiled with --release.
  2. Yes, it does.
  3. Running interactively it's already the CREATE EXTERNAL TABLE statement that takes ages. But you're right, the final query didn't run. With the semicolon, the final query takes 0.4 seconds (so as expected). So it's basically 100 second CREATE EXTERNAL TABLE + 0.4 second query.

Here's the full output:

datafusion-cli
DataFusion CLI v7.0.0
❯ CREATE EXTERNAL TABLE taxi
STORED AS CSV
WITH HEADER ROW
LOCATION './taxi.csv';
0 rows in set. Query took 106.102 seconds.
❯ SELECT passenger_count, COUNT(*), AVG(total_amount) FROM taxi GROUP BY passenger_count;
+-----------------+-----------------+------------------------+
| passenger_count | COUNT(UInt8(1)) | AVG(taxi.total_amount) |
+-----------------+-----------------+------------------------+
| 4               | 25510           | 18.452774990199917     |
| 9               | 1               | 113.6                  |
| 0               | 42228           | 17.021401676612687     |
| 5               | 50291           | 17.27092481756182      |
| 8               | 2               | 95.705                 |
|                 | 128020          | 32.237151148258164     |
| 2               | 286461          | 18.097587071189274     |
| 3               | 72852           | 17.915395871081138     |
| 1               | 1533197         | 17.6418833065818       |
| 6               | 32623           | 17.600296416638567     |
| 7               | 2               | 87.17                  |
+-----------------+-----------------+------------------------+
11 rows in set. Query took 0.385 seconds.
❯

@matthewmturner
Copy link
Contributor

Indeed, I was able to reproduce the performance regression building from source:

Master (maybe a few commits behind, i havent pulled latest in a few days)

DataFusion CLI v7.0.0
❯ CREATE EXTERNAL TABLE taxi STORED AS CSV WITH HEADER ROW LOCATION './taxi.csv';
0 rows in set. Query took 84.245 seconds.

7.0.0

DataFusion CLI v7.0.0
❯ CREATE EXTERNAL TABLE taxi STORED AS CSV WITH HEADER ROW LOCATION './taxi.csv';
0 rows in set. Query took 112.486 seconds.

6.0.0 (I think the version shown when launching is wrong)

DataFusion CLI v5.1.0-SNAPSHOT

❯ CREATE EXTERNAL TABLE taxi STORED AS CSV WITH HEADER ROW LOCATION './taxi.csv';
0 rows in set. Query took 2.645 seconds.

My guess is that this could be coming from an arrow-rs related change (which handles IO) - but i havent been tracking all the changes in detail there lately. I likely wont have time to dig into this more for a few days.

@alamb does anything come to mind?

@alamb
Copy link
Contributor

alamb commented Mar 30, 2022

@alamb does anything come to mind?

Sometimes I have seen this kind of regression when there is no BufReader when doing IO. Let me poke at it a bit

@alamb
Copy link
Contributor

alamb commented Mar 30, 2022

I think something in the DF 7.0 line made the number of lines used to infer the schema configurable, and the default changed to use "the whole file".

Thus, in 7.0 the datafusion-cli appears to be parsing the entire CSV file to do schema inference.

When I applied the following diff, the time went from 131.012 seconds locally to 0.076 seconds.

(arrow_dev) alamb@MacBook-Pro-2:~/Software/arrow-datafusion$ git diff
diff --git a/datafusion/core/src/datasource/file_format/csv.rs b/datafusion/core/src/datasource/file_format/csv.rs
index 29ca84a12..c0a6307e8 100644
--- a/datafusion/core/src/datasource/file_format/csv.rs
+++ b/datafusion/core/src/datasource/file_format/csv.rs
@@ -95,7 +95,7 @@ impl FileFormat for CsvFormat {
     async fn infer_schema(&self, mut readers: ObjectReaderStream) -> Result<SchemaRef> {
         let mut schemas = vec![];
 
-        let mut records_to_read = self.schema_infer_max_rec.unwrap_or(std::usize::MAX);
+        let mut records_to_read = self.schema_infer_max_rec.unwrap_or(1000);
 
         while let Some(obj_reader) = readers.next().await {
             let mut reader = obj_reader?.sync_reader()?;
(arrow_dev) alamb@MacBook-Pro-2:~/Software/arrow-datafusion$ 

I suggest we change the default value of schema_infer_max_rec to something sensible like 100 or 1000. I think it is exceedingly rare to need to use more than this.

FYI @jychen7 if you are looking for good candidates for changes to backport for a 7.1 type release, this would be one :)

@alamb
Copy link
Contributor

alamb commented Mar 30, 2022

Thanks for the report @cube2222

@cube2222
Copy link
Author

Glad it's useful!

@jychen7
Copy link
Contributor

jychen7 commented Mar 31, 2022

@jychen7
Copy link
Contributor

jychen7 commented Mar 31, 2022

I agree we can set default const 1000 at CsvFormat and use it across the code base, to align with other file format, we probably should set it for JSON as well? (avro and parquet is not applicable)

https://github.com/apache/arrow-datafusion/blob/c43b9ab9922ccbaaf6fe6f27e3d31201989edb1e/datafusion/core/src/datasource/file_format/csv.rs#L46-L54
https://github.com/apache/arrow-datafusion/blob/c43b9ab9922ccbaaf6fe6f27e3d31201989edb1e/datafusion/core/src/datasource/file_format/json.rs#L48-L50

@alamb alamb changed the title Almost 100x slowdown on 0.7.0 with CSV file Almost 100x slowdown on 0.7.0 with CSV file due to parsing entire file to infer schema Mar 31, 2022
@jychen7
Copy link
Contributor

jychen7 commented Apr 2, 2022

I agree we can set default const 1000 at CsvFormat and use it across the code base, to align with other file format, we probably should set it for JSON as well? (avro and parquet is not applicable)

I will try create a PR this weekend and see if we can do our first minor release next weekend

@Dandandan
Copy link
Contributor

Great find!👍

Another thing might be useful in the future is to optimize inferring inferring the types.

It makes sense it is slower than parsing the CSV, given that we don't know the types, but it sounds it shouldn't be ~100x as slow.

alamb pushed a commit that referenced this issue Apr 4, 2022
* set default schema infer max record

* fix unrelated issue "error: format argument must be a string literal" during `cargo test`
@jychen7
Copy link
Contributor

jychen7 commented Apr 5, 2022

if you are looking for good candidates for changes to backport for a 7.1 type release, this would be one :)

yes, #2159 is created

alamb pushed a commit that referenced this issue Apr 5, 2022
* set default schema infer max record

* fix unrelated issue "error: format argument must be a string literal" during `cargo test`

* fix clippy

same as https://github.com/apache/arrow-datafusion/pull/1885/files which already in master
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants