Skip to content

Conversation

@getChan
Copy link
Contributor

@getChan getChan commented Oct 1, 2025

Which issue does this PR close?

todo list

  1. Review whether we can remove public APIs from our own implementations.
  2. Applying arrow-avro 56.0.0 release.
  3. more test

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added common Related to common crate datasource Changes to the datasource crate labels Oct 1, 2025
@alamb
Copy link
Contributor

alamb commented Oct 2, 2025

❤️ amazing! Thank you @getChan
FYI @jecsand838 and @nathaniel-d-ef

@alamb
Copy link
Contributor

alamb commented Oct 19, 2025

Hi @getChan -- I am preparing to make an arrow release -- have you hit any blockers while integrating the new arrow-avro crate into DataFusion?

@getChan
Copy link
Contributor Author

getChan commented Oct 19, 2025

Hi @getChan -- I am preparing to make an arrow release -- have you hit any blockers while integrating the new arrow-avro crate into DataFusion?

No, not yet. Thanks for release.

@nathaniel-d-ef
Copy link

Thanks for jumping on this @getChan; let me know if I can help!

@github-actions github-actions bot removed the common Related to common crate label Oct 27, 2025
@alamb
Copy link
Contributor

alamb commented Oct 29, 2025

FYI I merged the arrow 57 upgrade to DataFusion -- so if you rebase this PR against main you'll have access to the new arrow-avro crate

@github-actions github-actions bot added core Core DataFusion crate common Related to common crate proto Related to proto crate labels Oct 29, 2025
@getChan
Copy link
Contributor Author

getChan commented Nov 10, 2025

Hey @getChan, I came across this same issue on Friday while working on an implementation of the writer. The arrow-avro reader can absolutely handle a schema with a custom name; there are thorough tests in the crate that demonstrate this. What I think is going on here is that the name is lost in the Original Avro -> DataFusion SchemaRef -> Projected Avro process. The AvroSchema::try_from() in AvroSource generates a projected schema without the name. We need that optimized schema in order to provide the ReaderBuilder with the correct projection. In other words, it works fine when passing a named schema directly to the ReaderBuilder, but not one that has been funneled through the optimizations of DataFusion via Arrow, where some contextual information is lost.

@jecsand838 any thoughts on this?

Thanks for help.
as you said. my Implementation injects a schema into arrow-avro to perform projection, but doing so causes the metadata to be lost.
As a result, if I provide a schema for projection I cannot read the custom record name; conversely, if I don't provide a schema so I can read the custom record name, I can't perform projection.
I'll try to find another solution.

# Conflicts:
#	datafusion/datasource-avro/src/source.rs
@jecsand838
Copy link

jecsand838 commented Nov 10, 2025

Thanks for help.
as you said. my Implementation injects a schema into arrow-avro to perform projection, but doing so causes the metadata to be lost.
As a result, if I provide a schema for projection I cannot read the custom record name; conversely, if I don't provide a schema so I can read the custom record name, I can't perform projection.
I'll try to find another solution.

Coming in a little bit late to this. @getChan Basically the way we treat "projection" in arrow-avro is as a function of schema resolution. To utilize it (for now), you'll need to use a reader schema as your projection schema and a valid writer schema (which 100% matches the underlying Avro data as written) as your base schema.

I think to get this working we'll need to extend arrow-avro to include a new with_projection() method that's more compatible with DataFusion and correctly builds the correct decoder path under the hood.

Also have you tried setting the AVRO_NAME_METADATA_KEY at the root level to force the rootRecordName? Refer to this link. If that doesn't work then for a short term solution, is it possible to clone the writer schema and ad hoc prune the fields? Then use that as the reader schema?

What I mean by that is you can craft the correct Reader Schema json string and then use AvroSchema::new(). AvroSchema is fundamentally just a wrapper around a valid Avro Schema JSON string.

There's two ways I can see this working:

  1. Create the Reader JSON string directly from the Writer AvroSchema (the field for the json is pub).
  2. Use the AvroSchema::try_from() and then post process the correct metadata in.

The AvroSchema::try_from() in AvroSource generates a projected schema without the name. We need that optimized schema in order to provide the ReaderBuilder with the correct projection. In other words, it works fine when passing a named schema directly to the ReaderBuilder, but not one that has been funneled through the optimizations of DataFusion via Arrow, where some contextual information is lost.

I'll create an arrow-rs ticket to add an AvroSchemaBuilder to arrow-avro and improve this. The reason the public api here is so tight was that I planned to refactor most of this logic and didn't want to cause breaking changes.

CC: @nathaniel-d-ef

@nathaniel-d-ef
Copy link

nathaniel-d-ef commented Nov 18, 2025

@jecsand838

I had a chance to get back at this today to try and find a workaround. Unless I'm missing something, the writer schema from the ReaderBuilder inference (from the Avro file, with the correct top level name) isn't exposed in a way that we can use in DataFusion. I'm curious if you've had success @getChan?

I think this effort is blocked until we can make the arrow-avro modifications.

@getChan
Copy link
Contributor Author

getChan commented Nov 19, 2025

@nathaniel-d-ef
No not yet. I'm still looking for a solution.
Whether or not projection is applied, I couldn't retrieve the schema metadata when reading the file with ReaderBuilder.
I don't know the arrow-avro internal implementation very well, so I'm investigating. I'll share it once I find a solution.

let avro_reader = ReaderBuilder::new().build(BufReader::new(reader))?;
println!("AVRO READER METADATA : {:?}", avro_reader.schema().metadata); // {}

@jecsand838
Copy link

jecsand838 commented Nov 25, 2025

@nathaniel-d-ef No not yet. I'm still looking for a solution. Whether or not projection is applied, I couldn't retrieve the schema metadata when reading the file with ReaderBuilder. I don't know the arrow-avro internal implementation very well, so I'm investigating. I'll share it once I find a solution.

let avro_reader = ReaderBuilder::new().build(BufReader::new(reader))?;
println!("AVRO READER METADATA : {:?}", avro_reader.schema().metadata); // {}

@getChan Try using avro_header instead to get the OCF Header:

        let avro_reader = ReaderBuilder::new().build(BufReader::new(reader))?;
        let header = avro_reader.avro_header();
        println!("\nAVRO HEADER METADATA BYTES: {:?}", header.metadata().collect::<Vec<_>>());
        let writer_avro = AvroSchema::new(
            std::str::from_utf8(
                header
                    .get(SCHEMA_METADATA_KEY.as_bytes())
                    .expect("missing avro.schema metadata"),
            )
                .unwrap()
                .to_string(),
        );
        println!("AVRO HEADER SCHEMA METADATA : {:?}", writer_avro);

You should see an output like this:

AVRO HEADER METADATA BYTES: [([97, 118, 114, 111, 46, 115, 99, 104, 101, 109, 97], [123, 34, 116, 121, 112, 101, 34, 58, 34, 114, 101, 99, 111, 114, 100, 34, 44, 34, 110, 97, 109, 101, 34, 58, 34, 116, 111, 112, 76, 101, 118, 101, 108, 82, 101, 99, 111, 114, 100, 34, 44, 34, 102, 105, 101, 108, 100, 115, 34, 58, 91, 123, 34, 110, 97, 109, 101, 34, 58, 34, 105, 100, 34, 44, 34, 116, 121, 112, 101, 34, 58, 91, 34, 105, 110, 116, 34, 44, 34, 110, 117, 108, 108, 34, 93, 125, 44, 123, 34, 110, 97, 109, 101, 34, 58, 34, 98, 111, 111, 108, 95, 99, 111, 108, 34, 44, 34, 116, 121, 112, 101, 34, 58, 91, 34, 98, 111, 111, 108, 101, 97, 110, 34, 44, 34, 110, 117, 108, 108, 34, 93, 125, 44, 123, 34, 110, 97, 109, 101, 34, 58, 34, 116, 105, 110, 121, 105, 110, 116, 95, 99, 111, 108, 34, 44, 34, 116, 121, 112, 101, 34, 58, 91, 34, 105, 110, 116, 34, 44, 34, 110, 117, 108, 108, 34, 93, 125, 44, 123, 34, 110, 97, 109, 101, 34, 58, 34, 115, 109, 97, 108, 108, 105, 110, 116, 95, 99, 111, 108, 34, 44, 34, 116, 121, 112, 101, 34, 58, 91, 34, 105, 110, 116, 34, 44, 34, 110, 117, 108, 108, 34, 93, 125, 44, 123, 34, 110, 97, 109, 101, 34, 58, 34, 105, 110, 116, 95, 99, 111, 108, 34, 44, 34, 116, 121, 112, 101, 34, 58, 91, 34, 105, 110, 116, 34, 44, 34, 110, 117, 108, 108, 34, 93, 125, 44, 123, 34, 110, 97, 109, 101, 34, 58, 34, 98, 105, 103, 105, 110, 116, 95, 99, 111, 108, 34, 44, 34, 116, 121, 112, 101, 34, 58, 91, 34, 108, 111, 110, 103, 34, 44, 34, 110, 117, 108, 108, 34, 93, 125, 44, 123, 34, 110, 97, 109, 101, 34, 58, 34, 102, 108, 111, 97, 116, 95, 99, 111, 108, 34, 44, 34, 116, 121, 112, 101, 34, 58, 91, 34, 102, 108, 111, 97, 116, 34, 44, 34, 110, 117, 108, 108, 34, 93, 125, 44, 123, 34, 110, 97, 109, 101, 34, 58, 34, 100, 111, 117, 98, 108, 101, 95, 99, 111, 108, 34, 44, 34, 116, 121, 112, 101, 34, 58, 91, 34, 100, 111, 117, 98, 108, 101, 34, 44, 34, 110, 117, 108, 108, 34, 93, 125, 44, 123, 34, 110, 97, 109, 101, 34, 58, 34, 100, 97, 116, 101, 95, 115, 116, 114, 105, 110, 103, 95, 99, 111, 108, 34, 44, 34, 116, 121, 112, 101, 34, 58, 91, 34, 98, 121, 116, 101, 115, 34, 44, 34, 110, 117, 108, 108, 34, 93, 125, 44, 123, 34, 110, 97, 109, 101, 34, 58, 34, 115, 116, 114, 105, 110, 103, 95, 99, 111, 108, 34, 44, 34, 116, 121, 112, 101, 34, 58, 91, 34, 98, 121, 116, 101, 115, 34, 44, 34, 110, 117, 108, 108, 34, 93, 125, 44, 123, 34, 110, 97, 109, 101, 34, 58, 34, 116, 105, 109, 101, 115, 116, 97, 109, 112, 95, 99, 111, 108, 34, 44, 34, 116, 121, 112, 101, 34, 58, 91, 123, 34, 116, 121, 112, 101, 34, 58, 34, 108, 111, 110, 103, 34, 44, 34, 108, 111, 103, 105, 99, 97, 108, 84, 121, 112, 101, 34, 58, 34, 116, 105, 109, 101, 115, 116, 97, 109, 112, 45, 109, 105, 99, 114, 111, 115, 34, 125, 44, 34, 110, 117, 108, 108, 34, 93, 125, 93, 125]), ([111, 114, 103, 46, 97, 112, 97, 99, 104, 101, 46, 115, 112, 97, 114, 107, 46, 118, 101, 114, 115, 105, 111, 110], [51, 46, 49, 46, 50]), ([97, 118, 114, 111, 46, 99, 111, 100, 101, 99], [115, 110, 97, 112, 112, 121])]


AVRO HEADER SCHEMA METADATA : AvroSchema { json_string: "{\"type\":\"record\",\"name\":\"topLevelRecord\",\"fields\":[{\"name\":\"id\",\"type\":[\"int\",\"null\"]},{\"name\":\"bool_col\",\"type\":[\"boolean\",\"null\"]},{\"name\":\"tinyint_col\",\"type\":[\"int\",\"null\"]},{\"name\":\"smallint_col\",\"type\":[\"int\",\"null\"]},{\"name\":\"int_col\",\"type\":[\"int\",\"null\"]},{\"name\":\"bigint_col\",\"type\":[\"long\",\"null\"]},{\"name\":\"float_col\",\"type\":[\"float\",\"null\"]},{\"name\":\"double_col\",\"type\":[\"double\",\"null\"]},{\"name\":\"date_string_col\",\"type\":[\"bytes\",\"null\"]},{\"name\":\"string_col\",\"type\":[\"bytes\",\"null\"]},{\"name\":\"timestamp_col\",\"type\":[{\"type\":\"long\",\"logicalType\":\"timestamp-micros\"},\"null\"]}]}" }

Copy link

@jecsand838 jecsand838 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@getChan @nathaniel-d-ef

I think I found an approach to get this working. I added comments detailing the suggested changes to make and it all seems to work for me locally. With that said, I'm still fairly new to this codebase, so I apologize in advance if I'm missing something.

Let me know what you think and if this solves the projection issue.

----
logical_plan TableScan: avro_table projection=[f1, f2, f3]
physical_plan DataSourceExec: file_groups={1 group: [[WORKSPACE_ROOT/testing/data/avro/simple_enum.avro]]}, projection=[f1, f2, f3], file_type=avro
physical_plan DataSourceExec: file_groups={4 groups: [[WORKSPACE_ROOT/testing/data/avro/simple_enum.avro:0..103], [WORKSPACE_ROOT/testing/data/avro/simple_enum.avro:103..206], [WORKSPACE_ROOT/testing/data/avro/simple_enum.avro:206..309], [WORKSPACE_ROOT/testing/data/avro/simple_enum.avro:309..411]]}, projection=[f1, f2, f3], file_type=avro
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is pretty neat

# Conflicts:
#	Cargo.lock
#	Cargo.toml
#	datafusion/datasource-avro/src/avro_to_arrow/arrow_array_reader.rs
#	datafusion/datasource-avro/src/avro_to_arrow/schema.rs
#	datafusion/datasource-avro/src/source.rs
@getChan
Copy link
Contributor Author

getChan commented Dec 20, 2025

There is a compatibility issue with projection. I'm waiting for a release of arrow-avro that includes the necessary projection features.

  • DataFusion does not infer the schema from the file when the table schema is explicitly defined.
  • arrow-avro requires reading the Avro file metadata (avro.schema) to perform projection.
  • Consequently, projection is problematic when reading Avro tables with explicitly defined schemas.
    Please let me know if my understanding is incorrect or if there is a workaround.

@alamb
Copy link
Contributor

alamb commented Dec 20, 2025

Thanks for the update @getChan

If the fix already included in arrow-avro (and you are waiting on a release), you could rebase this PR against this branch #19355 to get access to the pre-release code

We would have to wait for the arrow release to actually merge it but it could potentially help unblock your work

I actually would love to get some validation that we can cut over to the new arrow-avro reader before we make the next arrow release (so we can fix any issue that might be found)

@jecsand838
Copy link

Thanks for the update @getChan

If the fix already included in arrow-avro (and you are waiting on a release), you could rebase this PR against this branch #19355 to get access to the pre-release code

We would have to wait for the arrow release to actually merge it but it could potentially help unblock your work

I actually would love to get some validation that we can cut over to the new arrow-avro reader before we make the next arrow release (so we can fix any issue that might be found)

@alamb I'm going to start working on apache/arrow-rs#8923 early next week and should have a PR up before Jan 1st.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

common Related to common crate core Core DataFusion crate datasource Changes to the datasource crate proto Related to proto crate sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Use arrow-avro for performance and improved type support

4 participants