-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Almost 100x slowdown on 0.7.0 with CSV file due to parsing entire file to infer schema #2109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi @cube2222 - thanks for report. A few quick things that come to mind (I haven't had chance to try replicating or digging into it though so sorry if these are obvious or you already tried):
|
Hey @matthewmturner, sure:
Here's the full output:
|
Indeed, I was able to reproduce the performance regression building from source: Master (maybe a few commits behind, i havent pulled latest in a few days)
7.0.0
6.0.0 (I think the version shown when launching is wrong)
My guess is that this could be coming from an arrow-rs related change (which handles IO) - but i havent been tracking all the changes in detail there lately. I likely wont have time to dig into this more for a few days. @alamb does anything come to mind? |
Sometimes I have seen this kind of regression when there is no |
I think something in the DF 7.0 line made the number of lines used to infer the schema configurable, and the default changed to use "the whole file". Thus, in 7.0 the datafusion-cli appears to be parsing the entire CSV file to do schema inference. When I applied the following diff, the time went from 131.012 seconds locally to 0.076 seconds. (arrow_dev) alamb@MacBook-Pro-2:~/Software/arrow-datafusion$ git diff
diff --git a/datafusion/core/src/datasource/file_format/csv.rs b/datafusion/core/src/datasource/file_format/csv.rs
index 29ca84a12..c0a6307e8 100644
--- a/datafusion/core/src/datasource/file_format/csv.rs
+++ b/datafusion/core/src/datasource/file_format/csv.rs
@@ -95,7 +95,7 @@ impl FileFormat for CsvFormat {
async fn infer_schema(&self, mut readers: ObjectReaderStream) -> Result<SchemaRef> {
let mut schemas = vec![];
- let mut records_to_read = self.schema_infer_max_rec.unwrap_or(std::usize::MAX);
+ let mut records_to_read = self.schema_infer_max_rec.unwrap_or(1000);
while let Some(obj_reader) = readers.next().await {
let mut reader = obj_reader?.sync_reader()?;
(arrow_dev) alamb@MacBook-Pro-2:~/Software/arrow-datafusion$ I suggest we change the default value of FYI @jychen7 if you are looking for good candidates for changes to backport for a 7.1 type release, this would be one :) |
Thanks for the report @cube2222 |
Glad it's useful! |
interesting, from the However, during DDL, the context use Haven't read 6.0.0 about why it is fast |
I agree we can set default const https://github.com/apache/arrow-datafusion/blob/c43b9ab9922ccbaaf6fe6f27e3d31201989edb1e/datafusion/core/src/datasource/file_format/csv.rs#L46-L54 |
I will try create a PR this weekend and see if we can do our first minor release next weekend |
Great find!👍 Another thing might be useful in the future is to optimize inferring inferring the types. It makes sense it is slower than parsing the CSV, given that we don't know the types, but it sounds it shouldn't be ~100x as slow. |
* set default schema infer max record * fix unrelated issue "error: format argument must be a string literal" during `cargo test`
yes, #2159 is created |
* set default schema infer max record * fix unrelated issue "error: format argument must be a string literal" during `cargo test` * fix clippy same as https://github.com/apache/arrow-datafusion/pull/1885/files which already in master
Describe the bug
I'm running benchmarks for OctoSQL and datafusion-cli is one of the tools I compare against. The previous version I used (0.6.0 I think) did the benchmark in 1.5 second. The new version takes 100 (!!!) seconds. It also prints "0 rows in set", which makes me think this is a CSV decoder regression.
This is based on the nyc yellow taxi dataset.
To Reproduce
Expected behavior
Datafusion is supposed to be much faster.
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: