Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/configuration/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -1431,6 +1431,7 @@ Additional Peon configs include:
|`druid.indexer.task.tmpStorageBytesPerTask`|Maximum number of bytes per task to be used to store temporary files on disk. This config is generally intended for internal usage. Attempts to set it are very likely to be overwritten by the TaskRunner that executes the task, so be sure of what you expect to happen before directly adjusting this configuration parameter. The config is documented here primarily to provide an understanding of what it means if/when someone sees that it has been set. A value of -1 disables this limit. |-1|
|`druid.indexer.task.allowHadoopTaskExecution`|Conditional dictating if the cluster allows `index_hadoop` tasks to be executed. `index_hadoop` is deprecated, and defaulting to false will force cluster operators to acknowledge the deprecation and consciously opt in to using index_hadoop with the understanding that it will be removed in the future.|false|
|`druid.indexer.server.maxChatRequests`|Maximum number of concurrent requests served by a task's chat handler. Set to 0 to disable limiting.|0|
|`druid.indexing.formats.maxStringLength`|Maximum number of characters to store per string dimension value. Longer values are truncated during ingestion. Set to 0 to disable. Can be overridden per-dimension using `maxStringLength` in the [dimension object](../ingestion/ingestion-spec.md#dimension-objects).|0 (no truncation)|

If the Peon is running in remote mode, there must be an Overlord up and running. Peons in remote mode can set the following configurations:

Expand Down
1 change: 1 addition & 0 deletions docs/ingestion/ingestion-spec.md
Original file line number Diff line number Diff line change
Expand Up @@ -247,6 +247,7 @@ Dimension objects can have the following components:
| name | The name of the dimension. This will be used as the field name to read from input records, as well as the column name stored in generated segments.<br /><br />Note that you can use a [`transformSpec`](#transformspec) if you want to rename columns during ingestion time. | none (required) |
| createBitmapIndex | For `string` typed dimensions, whether or not bitmap indexes should be created for the column in generated segments. Creating a bitmap index requires more storage, but speeds up certain kinds of filtering (especially equality and prefix filtering). Only supported for `string` typed dimensions. | `true` |
| multiValueHandling | For `string` typed dimensions, specifies the type of handling for [multi-value fields](../querying/multi-value-dimensions.md). Possible values are `array` (ingest string arrays as-is), `sorted_array` (sort string arrays during ingestion), and `sorted_set` (sort and de-duplicate string arrays during ingestion). This parameter is ignored for types other than `string`. | `sorted_array` |
| maxStringLength | For `string` typed dimensions, the maximum number of characters to store per value. Longer values are truncated during ingestion. Set to 0 to disable. Overrides the global [`druid.indexing.formats.maxStringLength`](../configuration/index.md#additional-peon-configuration) property. | `0` (no truncation) |

#### Inclusions and exclusions

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -21,15 +21,26 @@

import com.fasterxml.jackson.annotation.JsonCreator;
import com.fasterxml.jackson.annotation.JsonIgnore;
import com.fasterxml.jackson.annotation.JsonInclude;
import com.fasterxml.jackson.annotation.JsonProperty;
import org.apache.druid.guice.BuiltInTypesModule;
import org.apache.druid.segment.DimensionHandler;
import org.apache.druid.segment.StringDimensionHandler;
import org.apache.druid.segment.column.ColumnType;

import javax.annotation.Nullable;

public class StringDimensionSchema extends DimensionSchema
{
private static final boolean DEFAULT_CREATE_BITMAP_INDEX = true;

public static int getDefaultMaxStringLength()
{
return BuiltInTypesModule.getMaxStringLength();
}

private final int maxStringLength;

@JsonCreator
public static StringDimensionSchema create(String name)
{
Expand All @@ -40,15 +51,33 @@ public static StringDimensionSchema create(String name)
public StringDimensionSchema(
@JsonProperty("name") String name,
@JsonProperty("multiValueHandling") MultiValueHandling multiValueHandling,
@JsonProperty("createBitmapIndex") Boolean createBitmapIndex
@JsonProperty("createBitmapIndex") Boolean createBitmapIndex,
@JsonProperty("maxStringLength") @Nullable Integer maxStringLength
Comment on lines +54 to +55
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

drive by comment (i'll have a closer look at rest of PR later)

instead of adding additional arguments here, I was hoping to deprecate these arguments in favor of adding a column format spec similar to was done for auto/json columns in #17762, which could serve as a reference for how this should be wired up. I was planning to move the existing createBitmapIndex and multiValueHandling into such a spec, but just haven't got to it yet. I think this would be much cleaner and less disruptive to call sites going forward. It also allows wiring up to IndexSpec to be able to define job level defaults as a middle place between per column and system wide.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking a look @clintropolis. Adding something like StringCommonFormatColumnFormatSpec would make it cleaner and makes sense to consolidate the configs there. Since it seems like a bigger refactor, does it make sense to do it in a follow up? Let me know what you think.

)
{
super(name, multiValueHandling, createBitmapIndex == null ? DEFAULT_CREATE_BITMAP_INDEX : createBitmapIndex);
this.maxStringLength = maxStringLength != null && maxStringLength > 0 ? maxStringLength : getDefaultMaxStringLength();
}

public StringDimensionSchema(
String name,
MultiValueHandling multiValueHandling,
Boolean createBitmapIndex
)
{
this(name, multiValueHandling, createBitmapIndex, getDefaultMaxStringLength());
}

public StringDimensionSchema(String name)
{
this(name, null, DEFAULT_CREATE_BITMAP_INDEX);
this(name, null, DEFAULT_CREATE_BITMAP_INDEX, getDefaultMaxStringLength());
}

@JsonProperty
@JsonInclude(JsonInclude.Include.NON_DEFAULT)
public int getMaxStringLength()
{
return maxStringLength;
}

@Override
Expand All @@ -73,6 +102,6 @@ public boolean canBeMultiValued()
@Override
public DimensionHandler getDimensionHandler()
{
return new StringDimensionHandler(getName(), getMultiValueHandling(), hasBitmapIndex(), false);
return new StringDimensionHandler(getName(), getMultiValueHandling(), hasBitmapIndex(), false, maxStringLength);
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,7 @@ public class BuiltInTypesModule implements DruidModule
*/
private static DimensionSchema.MultiValueHandling STRING_MV_MODE = DimensionSchema.MultiValueHandling.SORTED_ARRAY;
private static IndexSpec DEFAULT_INDEX_SPEC = IndexSpec.builder().build();
private static int MAX_STRING_LENGTH = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to set this to Integer max value? In case this is used elsewhere in the future there wouldn't need to explicit handling for 0 like you have in truncateIfNeeded

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are using NON_DEFAULT to not serialize the default value and I think for integer jackson's default is 0. If we set the MAX_STRING_LENGTH default as int max, it'll serialize this value for each dimension.


/**
* @return the configured string multi value handling mode from the system config if set; otherwise, returns
Expand Down Expand Up @@ -89,6 +90,7 @@ public void configure(Binder binder)
public SideEffectRegisterer initDimensionHandlerAndMvHandlingMode(DefaultColumnFormatConfig formatsConfig)
{
setStringMultiValueHandlingModeIfConfigured(formatsConfig.getStringMultiValueHandlingMode());
setMaxStringLengthIfConfigured(formatsConfig.getMaxStringLength());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can take a look at druid.indexing.formats.stringMultiValueHandlingMode in BuiltInTypesModuleTest It would be good to have some test coverage for the new property

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added tests for this property.

setIndexSpecDefaults(formatsConfig.getIndexSpec());
setNestedColumnDefaults(formatsConfig);

Expand Down Expand Up @@ -128,6 +130,24 @@ private static void registerSerde()
}
}

private static void setMaxStringLengthIfConfigured(@Nullable Integer maxStringLength)
{
if (maxStringLength != null) {
MAX_STRING_LENGTH = maxStringLength;
}
}

@VisibleForTesting
public static void setMaxStringLength(int maxStringLength)
{
MAX_STRING_LENGTH = maxStringLength;
}

public static int getMaxStringLength()
{
return MAX_STRING_LENGTH;
}

private static void setStringMultiValueHandlingModeIfConfigured(@Nullable String stringMultiValueHandlingMode)
{
if (stringMultiValueHandlingMode != null) {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -68,6 +68,21 @@ private static String validateMultiValueHandlingMode(
return stringMultiValueHandlingMode;
}

@Nullable
private static Integer validateMaxStringLength(@Nullable Integer maxStringLength)
{
if (maxStringLength != null && maxStringLength <= 0) {
throw DruidException.forPersona(DruidException.Persona.OPERATOR)
.ofCategory(DruidException.Category.INVALID_INPUT)
.build(
"Invalid value[%s] specified for 'druid.indexing.formats.maxStringLength'."
+ " Value must be a positive integer.",
maxStringLength
);
}
return maxStringLength;
}

@JsonProperty("stringMultiValueHandlingMode")
@Nullable
private final Integer nestedColumnFormatVersion;
Expand All @@ -80,11 +95,16 @@ private static String validateMultiValueHandlingMode(
@Nullable
private final IndexSpec indexSpec;

@JsonProperty("maxStringLength")
@Nullable
private final Integer maxStringLength;

@JsonCreator
public DefaultColumnFormatConfig(
@JsonProperty("stringMultiValueHandlingMode") @Nullable String stringMultiValueHandlingMode,
@JsonProperty("nestedColumnFormatVersion") @Nullable Integer nestedColumnFormatVersion,
@JsonProperty("indexSpec") @Nullable IndexSpec indexSpec
@JsonProperty("indexSpec") @Nullable IndexSpec indexSpec,
@JsonProperty("maxStringLength") @Nullable Integer maxStringLength
)
{
validateMultiValueHandlingMode(stringMultiValueHandlingMode);
Expand All @@ -93,6 +113,7 @@ public DefaultColumnFormatConfig(
this.stringMultiValueHandlingMode = validateMultiValueHandlingMode(stringMultiValueHandlingMode);
this.nestedColumnFormatVersion = nestedColumnFormatVersion;
this.indexSpec = indexSpec;
this.maxStringLength = validateMaxStringLength(maxStringLength);
}

@Nullable
Expand All @@ -116,6 +137,13 @@ public IndexSpec getIndexSpec()
return indexSpec;
}

@Nullable
@JsonProperty("maxStringLength")
public Integer getMaxStringLength()
{
return maxStringLength;
}

@Override
public boolean equals(Object o)
{
Expand All @@ -128,13 +156,14 @@ public boolean equals(Object o)
DefaultColumnFormatConfig that = (DefaultColumnFormatConfig) o;
return Objects.equals(nestedColumnFormatVersion, that.nestedColumnFormatVersion)
&& Objects.equals(stringMultiValueHandlingMode, that.stringMultiValueHandlingMode)
&& Objects.equals(indexSpec, that.indexSpec);
&& Objects.equals(indexSpec, that.indexSpec)
&& Objects.equals(maxStringLength, that.maxStringLength);
}

@Override
public int hashCode()
{
return Objects.hash(nestedColumnFormatVersion, stringMultiValueHandlingMode, indexSpec);
return Objects.hash(nestedColumnFormatVersion, stringMultiValueHandlingMode, indexSpec, maxStringLength);
}

@Override
Expand All @@ -144,6 +173,7 @@ public String toString()
"stringMultiValueHandlingMode=" + stringMultiValueHandlingMode +
", nestedColumnFormatVersion=" + nestedColumnFormatVersion +
", indexSpec=" + indexSpec +
", maxStringLength=" + maxStringLength +
'}';
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -104,18 +104,31 @@ private static IndexedInts getRow(ColumnValueSelector s)
private final MultiValueHandling multiValueHandling;
private final boolean hasBitmapIndexes;
private final boolean hasSpatialIndexes;
private final int maxStringLength;

public StringDimensionHandler(
String dimensionName,
MultiValueHandling multiValueHandling,
boolean hasBitmapIndexes,
boolean hasSpatialIndexes
)
{
this(dimensionName, multiValueHandling, hasBitmapIndexes, hasSpatialIndexes, StringDimensionSchema.getDefaultMaxStringLength());
}

public StringDimensionHandler(
String dimensionName,
MultiValueHandling multiValueHandling,
boolean hasBitmapIndexes,
boolean hasSpatialIndexes,
int maxStringLength
)
{
this.dimensionName = dimensionName;
this.multiValueHandling = multiValueHandling;
this.hasBitmapIndexes = hasBitmapIndexes;
this.hasSpatialIndexes = hasSpatialIndexes;
this.maxStringLength = maxStringLength;
}

@Override
Expand Down Expand Up @@ -160,7 +173,7 @@ public SettableColumnValueSelector makeNewSettableEncodedValueSelector()
@Override
public DimensionIndexer<Integer, int[], String> makeIndexer()
{
return new StringDimensionIndexer(multiValueHandling, hasBitmapIndexes, hasSpatialIndexes);
return new StringDimensionIndexer(multiValueHandling, hasBitmapIndexes, hasSpatialIndexes, maxStringLength);
}

@Override
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@
import org.apache.druid.collections.bitmap.BitmapFactory;
import org.apache.druid.collections.bitmap.MutableBitmap;
import org.apache.druid.data.input.impl.DimensionSchema.MultiValueHandling;
import org.apache.druid.data.input.impl.StringDimensionSchema;
import org.apache.druid.error.DruidException;
import org.apache.druid.java.util.common.ISE;
import org.apache.druid.java.util.common.StringUtils;
Expand Down Expand Up @@ -57,18 +58,38 @@ public class StringDimensionIndexer extends DictionaryEncodedColumnIndexer<int[]
private final MultiValueHandling multiValueHandling;
private final boolean hasBitmapIndexes;
private final boolean hasSpatialIndexes;
private final int maxStringLength;
private volatile boolean hasMultipleValues = false;

public StringDimensionIndexer(
@Nullable MultiValueHandling multiValueHandling,
boolean hasBitmapIndexes,
boolean hasSpatialIndexes
)
{
this(multiValueHandling, hasBitmapIndexes, hasSpatialIndexes, StringDimensionSchema.getDefaultMaxStringLength());
}

public StringDimensionIndexer(
@Nullable MultiValueHandling multiValueHandling,
boolean hasBitmapIndexes,
boolean hasSpatialIndexes,
int maxStringLength
)
{
super(new StringDimensionDictionary());
this.multiValueHandling = multiValueHandling == null ? MultiValueHandling.ofDefault() : multiValueHandling;
this.hasBitmapIndexes = hasBitmapIndexes;
this.hasSpatialIndexes = hasSpatialIndexes;
this.maxStringLength = maxStringLength;
}

private String truncateIfNeeded(String value)
{
if (maxStringLength > 0 && value != null && value.length() > maxStringLength) {
return value.substring(0, maxStringLength);
}
return value;
}

@Override
Expand All @@ -92,7 +113,7 @@ public EncodedKeyComponent<int[]> processRowValsToUnsortedEncodedKeyComponent(@N
dimLookup.add(null);
encodedDimensionValues = IntArrays.EMPTY_ARRAY;
} else if (dimValuesList.size() == 1) {
encodedDimensionValues = new int[]{dimLookup.add(Evals.asString(dimValuesList.get(0)))};
encodedDimensionValues = new int[]{dimLookup.add(truncateIfNeeded(Evals.asString(dimValuesList.get(0))))};
} else {
hasMultipleValues = true;
final String[] dimensionValues = new String[dimValuesList.size()];
Expand Down Expand Up @@ -125,7 +146,7 @@ public EncodedKeyComponent<int[]> processRowValsToUnsortedEncodedKeyComponent(@N
encodedDimensionValues =
new int[]{dimLookup.add(Evals.asString(StringUtils.encodeBase64String((byte[]) dimValues)))};
} else {
encodedDimensionValues = new int[]{dimLookup.add(Evals.asString(dimValues))};
encodedDimensionValues = new int[]{dimLookup.add(truncateIfNeeded(Evals.asString(dimValues)))};
}

// If dictionary size has changed, the sorted lookup is no longer valid.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -54,9 +54,11 @@ public void testDeserializeFromJson() throws JsonProcessingException
final String json = "{\n"
+ " \"name\" : \"dim\",\n"
+ " \"multiValueHandling\" : \"SORTED_SET\",\n"
+ " \"createBitmapIndex\" : false\n"
+ " \"createBitmapIndex\" : false,\n"
+ " \"maxStringLength\" : 200\n"
+ "}";
final StringDimensionSchema schema = (StringDimensionSchema) jsonMapper.readValue(json, DimensionSchema.class);
Assert.assertEquals(new StringDimensionSchema("dim", MultiValueHandling.SORTED_SET, false), schema);
Assert.assertEquals(200, schema.getMaxStringLength());
}
}
Loading
Loading