Skip to content

Partition values are not URL-decoded when extracted from Hive-style paths #19650

@greedAuguria

Description

@greedAuguria

Describe the bug

When using Hive-style partitioned tables where partition values contain URL-encoded characters (like / encoded as %2F or spaces as %20), DataFusion returns the literal encoded string instead of the decoded value.

For example, given a file at:
s3://bucket/table/category=foo%2Fbar/file.parquet

The partition column category returns the literal value foo%2Fbar instead of the expected decoded value foo/bar.

Related Issues

This is a follow-up to #7877, which was partially addressed by #8012.
While #8012 fixed URL decoding for the Table URL (ListingTableUrl::parse()), it did not apply decoding to the extracted partition values from the actual file paths within parse_partitions_for_path().

To Reproduce

use datafusion::datasource::listing::helpers::parse_partitions_for_path;
use datafusion::datasource::listing::ListingTableUrl;
use object_store::path::Path;

#[test]
fn test_reproduce_partition_decoding_issue() {
    let table_url = ListingTableUrl::parse("s3://bucket/table").unwrap();
    // Path contains URL encoded slash %2F
    let file_path = Path::from("bucket/table/category=foo%2Fbar/file.parquet");

    let partitions = parse_partitions_for_path(&table_url, &file_path, vec!["category"]);

    // Current behavior: Some(["foo%2Fbar"])
    // Expected behavior: Some(["foo/bar"])
    assert_eq!(partitions, Some(vec!["foo/bar".to_string()]));
}

Expected behavior

Partition values should be URL-decoded, consistent with how ListingTableUrl handles URL-encoded paths. This matches the behavior of Apache Spark and Apache Hive.

Additional context

The fix involves updating parse_partitions_for_path in datafusion/catalog-listing/src/helpers.rs to use percent-encoding.

Because decoding creates a new string, the function signature needs to change from Option<Vec<&str>> to Option<Vec<String>>.

This affects users storing data in Hive-partitioned layouts on object stores (S3/GCS/Azure) where special characters in paths are standard.

Common examples:

  • category=Electronics%2FComputersElectronics/Computers
  • city=San%20FranciscoSan Francisco

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions