Skip to content

Add configurable truncation for string columns#19146

Open
jaykanakiya wants to merge 3 commits intoapache:masterfrom
jaykanakiya:string-truncation
Open

Add configurable truncation for string columns#19146
jaykanakiya wants to merge 3 commits intoapache:masterfrom
jaykanakiya:string-truncation

Conversation

@jaykanakiya
Copy link
Contributor

@jaykanakiya jaykanakiya commented Mar 12, 2026

Summary

Adds a configurable maximum string length for string dimension columns. Strings exceeding the limit are truncated during ingestion.

  • Global config: druid.indexing.formats.maxStringLength
  • Per-dimension override: maxStringLength field in the dimension spec

Release note

Added a new maxStringLength configuration for string dimensions that truncates values exceeding the specified length during ingestion. Can be set globally via druid.indexing.formats.maxStringLength or per-dimension in the ingestion spec.


Key changed/added classes in this PR
  • DefaultColumnFormatConfig
  • StringDimensionSchema
  • StringDimensionHandler
  • StringDimensionIndexer

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added or updated version, license, or notice information in licenses.yaml
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

Comment on lines +54 to +55
@JsonProperty("createBitmapIndex") Boolean createBitmapIndex,
@JsonProperty("maxStringLength") @Nullable Integer maxStringLength
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

drive by comment (i'll have a closer look at rest of PR later)

instead of adding additional arguments here, I was hoping to deprecate these arguments in favor of adding a column format spec similar to was done for auto/json columns in #17762, which could serve as a reference for how this should be wired up. I was planning to move the existing createBitmapIndex and multiValueHandling into such a spec, but just haven't got to it yet. I think this would be much cleaner and less disruptive to call sites going forward. It also allows wiring up to IndexSpec to be able to define job level defaults as a middle place between per column and system wide.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking a look @clintropolis. Adding something like StringCommonFormatColumnFormatSpec would make it cleaner and makes sense to consolidate the configs there. Since it seems like a bigger refactor, does it make sense to do it in a follow up? Let me know what you think.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants