Skip to content

Support VARIANT type for unstructured data #16116

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
3 tasks
alamb opened this issue May 20, 2025 · 0 comments
Open
3 tasks

Support VARIANT type for unstructured data #16116

alamb opened this issue May 20, 2025 · 0 comments
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented May 20, 2025

Is your feature request related to a problem or challenge?

Processing semi-structured data (basically think anything that can be represented in JSON) efficiently is becoming more and more important.

As @wjones127 says in https://github.com/apache/datafusion/issues/10987>

This would be a high-performance data type for semi-structured data, designed for better OLAP performance than JSON or BSON (discussed in #7845).

While it is certainly possible to implement semi-structured, JSON and even Variant support today using the DataFusion extension apis (e.g. https://github.com/datafusion-contrib/datafusion-functions-json) this ticket tracks adding such support to DataFusion itself

Parquet recently adopted the Variant type : https://github.com/apache/parquet-format/blob/master/VariantEncoding.md

We see adoption of this in other systems as well such as Iceberg and Spark.

I think DataBricks did a good job describing its rationale:

Without Variant, customers had to choose between flexibility and performance. To maintain flexibility, customers would store JSON in single columns as strings. To see better performance, customers would apply strict schematizing approaches with structs, which requires separate processes to maintain and update with schema changes. With Variant, customers can retain flexibility (there's no need to define an explicit schema) and receive vastly improved performance compared to querying the JSON as a string.

Describe the solution you'd like

No response

Describe alternatives you've considered

This will be a big project. Here are some of the related pre-requisites

It is not clear to me if variant should be "built in" or if it should be an add on (for example, add a variant feature and a datafusion-variant crate)

Additional context

Related tickets

@alamb alamb added the enhancement New feature or request label May 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant