Support `VARIANT` type for unstructured data #16116

alamb · 2025-05-20T15:22:46Z

Is your feature request related to a problem or challenge?

Processing semi-structured data (basically think anything that can be represented in JSON) efficiently is becoming more and more important.

As @wjones127 says in https://github.com/apache/datafusion/issues/10987>

This would be a high-performance data type for semi-structured data, designed for better OLAP performance than JSON or BSON (discussed in #7845).

While it is certainly possible to implement semi-structured, JSON and even Variant support today using the DataFusion extension apis (e.g. https://github.com/datafusion-contrib/datafusion-functions-json) this ticket tracks adding such support to DataFusion itself

Parquet recently adopted the Variant type : https://github.com/apache/parquet-format/blob/master/VariantEncoding.md

We see adoption of this in other systems as well such as Iceberg and Spark.

Variant Data Type Support iceberg#10392

I think DataBricks did a good job describing its rationale:

https://www.databricks.com/blog/introducing-open-variant-data-type-delta-lake-and-apache-spark

Without Variant, customers had to choose between flexibility and performance. To maintain flexibility, customers would store JSON in single columns as strings. To see better performance, customers would apply strict schematizing approaches with structs, which requires separate processes to maintain and update with schema changes. With Variant, customers can retain flexibility (there's no need to define an explicit schema) and receive vastly improved performance compared to querying the JSON as a string.

Describe the solution you'd like

No response

Describe alternatives you've considered

This will be a big project. Here are some of the related pre-requisites

It is not clear to me if variant should be "built in" or if it should be an add on (for example, add a variant feature and a datafusion-variant crate)

Additional context

Related tickets

The text was updated successfully, but these errors were encountered:

alamb added the enhancement New feature or request label May 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support `VARIANT` type for unstructured data #16116

Support `VARIANT` type for unstructured data #16116

alamb commented May 20, 2025 •

edited

Loading

Support VARIANT type for unstructured data #16116

Support VARIANT type for unstructured data #16116

Comments

alamb commented May 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Support `VARIANT` type for unstructured data #16116

Support `VARIANT` type for unstructured data #16116

alamb commented May 20, 2025 •

edited

Loading