-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Enable dictionary inference for index-required indexes and test raw forward index paths #17269
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Enable dictionary inference for index-required indexes and test raw forward index paths #17269
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR enables dictionary inference for indexes that require dictionaries (inverted, FST, range) and enhances raw forward index handling throughout the codebase. It introduces automatic dictionary creation when required by certain index types, plumbs raw-encoding awareness through forward index creator/reader factories, and extends the forward index handler to support raw-forward columns in various operations.
Key changes:
- Automatic dictionary inference when indexes requiring dictionaries are configured
- Raw-encoding flag in ForwardIndexConfig to distinguish raw vs. dictionary-encoded forward indexes
- Enhanced ForwardIndexHandler to support raw forward indexes with dictionary add/remove operations
- New integration test verifying raw forward index + inverted index functionality
Reviewed changes
Copilot reviewed 29 out of 29 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| githubComplexTypeEvents_offline_table_config.json | Adds raw-encoded forward index with inverted index configuration example |
| ForwardIndexCreator.java | Refactors raw value handling into separate addRaw method |
| IndexType.java | Adds requiresDictionary method to indicate dictionary requirement |
| ForwardIndexUtils.java | New utility to identify raw forward index columns from table config |
| ForwardIndexConfig.java | Adds rawEncoding flag and getter/builder methods |
| FieldIndexConfigsUtil.java | Implements automatic dictionary enablement for required indexes |
| DictionaryIndexConfig.java | Adds utility methods to check if dictionary is required by any index |
| ForwardIndexTypeTest.java | Updates tests to include rawEncoding flag in expected configs |
| ColumnMinMaxValueGenerator.java | Passes ForwardIndexConfig when creating readers |
| InvertedIndexAndDictionaryBasedForwardIndexCreator.java | Respects rawEncoding flag when determining dictionary usage |
| ForwardIndexHandler.java | Adds ENABLE_DICTIONARY_FOR_RAW_FORWARD_INDEX operation and handling logic |
| InvertedIndexType.java | Implements requiresDictionary to return true |
| IFSTIndexType.java | Implements requiresDictionary to return true |
| FstIndexType.java | Implements requiresDictionary to return true |
| ForwardIndexType.java | Updates getFileExtension to return multiple possible extensions |
| ForwardIndexReaderFactory.java | Uses rawEncoding flag to determine reader type |
| ForwardIndexCreatorFactory.java | Uses rawEncoding flag to determine creator type |
| SingleValueVarByteRawIndexCreator.java | Overrides add method to delegate to addRaw |
| SingleValueFixedByteRawIndexCreator.java | Overrides add method to delegate to addRaw |
| MultiValueVarByteRawIndexCreator.java | Overrides add method to delegate to addRaw |
| MultiValueFixedByteRawIndexCreator.java | Overrides add method to delegate to addRaw |
| CLPForwardIndexCreatorV2.java | Overrides add method for CLP encoding |
| CLPForwardIndexCreatorV1.java | Overrides add method for CLP encoding |
| SegmentIndexCreationDriverImpl.java | Fixes inverted logic bug (isDisabled → isEnabled) |
| SegmentColumnarIndexCreator.java | Adds warning when dictionary required but explicitly disabled |
| SameValueForwardIndexCreator.java | Overrides add method to delegate to addRaw |
| RawForwardIndexInvertedIndexTest.java | New integration test for raw forward index with inverted index |
| ForwardIndexHandlerReloadQueriesTest.java | Updates test expectations and adds raw forward index test coverage |
| DataFetcher.java | Adds _useDictionary field for proper dictionary usage check |
pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/index/ForwardIndexConfig.java
Outdated
Show resolved
Hide resolved
...in/java/org/apache/pinot/segment/local/segment/creator/impl/SegmentColumnarIndexCreator.java
Outdated
Show resolved
Hide resolved
pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/index/DictionaryIndexConfig.java
Outdated
Show resolved
Hide resolved
pinot-core/src/test/java/org/apache/pinot/queries/ForwardIndexHandlerReloadQueriesTest.java
Outdated
Show resolved
Hide resolved
880c72e to
e042d90
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 29 out of 29 changed files in this pull request and generated no new comments.
74727cd to
d202d42
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #17269 +/- ##
============================================
- Coverage 63.27% 63.25% -0.02%
- Complexity 1474 1490 +16
============================================
Files 3139 3140 +1
Lines 187189 187346 +157
Branches 28652 28687 +35
============================================
+ Hits 118437 118510 +73
- Misses 59586 59660 +74
- Partials 9166 9176 +10
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
c2dfdcc to
4b4bcd7
Compare
5fd7f53 to
9e63630
Compare
9e63630 to
677c60f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 37 out of 37 changed files in this pull request and generated 7 comments.
| @Test(dataProvider = "bothV1AndV3", expectedExceptions = RuntimeException.class) | ||
| public void testEnableTextIndexOnNewColumnRaw(SegmentVersion segmentVersion) | ||
| throws Exception { | ||
| buildSegment(segmentVersion); | ||
| _fieldConfigMap.put(NEWLY_ADDED_STRING_COL_RAW, | ||
| new FieldConfig(NEWLY_ADDED_STRING_COL_RAW, FieldConfig.EncodingType.RAW, List.of(FieldConfig.IndexType.TEXT), | ||
| null, null)); | ||
| _fieldConfigMap.put(NEWLY_ADDED_STRING_MV_COL_RAW, | ||
| new FieldConfig(NEWLY_ADDED_STRING_MV_COL_RAW, FieldConfig.EncodingType.RAW, | ||
| List.of(FieldConfig.IndexType.TEXT), null, null)); | ||
| checkTextIndexCreation(NEWLY_ADDED_STRING_COL_RAW, 1, 1, _newColumnsSchemaWithText, true, true, true, 4); |
Copilot
AI
Dec 11, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test expects RuntimeException but doesn't verify the exception message or root cause. The test should either validate the specific exception type/message or document why any RuntimeException is acceptable. Additionally, removed test coverage for NEWLY_ADDED_STRING_MV_COL_RAW should be restored or its removal justified.
| try { | ||
| checkForwardIndexCreation(COLUMN10_NAME, 3960, 12, _schema, false, false, false, 0, ChunkCompressionType.LZ4, | ||
| true, 0, DataType.INT, 100000); | ||
| validateIndex(StandardIndexes.range(), COLUMN10_NAME, 3960, 12, false, false, false, 0, true, 0, | ||
| ChunkCompressionType.LZ4, false, DataType.INT, 100000); | ||
| long newRangeIndexSize = new SegmentMetadataImpl(INDEX_DIR).getColumnMetadataFor(COLUMN10_NAME) | ||
| .getIndexSizeFor(StandardIndexes.range()); | ||
| assertNotEquals(oldRangeIndexSize, newRangeIndexSize); | ||
| } catch (RuntimeException e) { | ||
| // Forward-index-disabled columns cannot rebuild range index without an existing forward index; accept failure. | ||
| } |
Copilot
AI
Dec 11, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Empty catch block swallows RuntimeException without logging or assertion. This makes test failures silent and hard to debug. Consider either asserting specific exception conditions, logging the caught exception, or restructuring the test to explicitly handle both success and failure paths with clear assertions.
| public List<String> getFileExtension(ColumnMetadata columnMetadata) { | ||
| if (columnMetadata.isSingleValue()) { | ||
| if (!columnMetadata.hasDictionary()) { | ||
| return V1Constants.Indexes.RAW_SV_FORWARD_INDEX_FILE_EXTENSION; | ||
| return List.of(V1Constants.Indexes.RAW_SV_FORWARD_INDEX_FILE_EXTENSION); | ||
| } else if (columnMetadata.isSorted()) { | ||
| return V1Constants.Indexes.SORTED_SV_FORWARD_INDEX_FILE_EXTENSION; | ||
| return List.of(V1Constants.Indexes.RAW_SV_FORWARD_INDEX_FILE_EXTENSION, | ||
| V1Constants.Indexes.SORTED_SV_FORWARD_INDEX_FILE_EXTENSION); | ||
| } else { | ||
| return V1Constants.Indexes.UNSORTED_SV_FORWARD_INDEX_FILE_EXTENSION; | ||
| return List.of(V1Constants.Indexes.RAW_SV_FORWARD_INDEX_FILE_EXTENSION, | ||
| V1Constants.Indexes.UNSORTED_SV_FORWARD_INDEX_FILE_EXTENSION); | ||
| } | ||
| } else if (!columnMetadata.hasDictionary()) { | ||
| return V1Constants.Indexes.RAW_MV_FORWARD_INDEX_FILE_EXTENSION; | ||
| return List.of(V1Constants.Indexes.RAW_MV_FORWARD_INDEX_FILE_EXTENSION); | ||
| } else { | ||
| return V1Constants.Indexes.UNSORTED_MV_FORWARD_INDEX_FILE_EXTENSION; | ||
| return List.of(V1Constants.Indexes.RAW_MV_FORWARD_INDEX_FILE_EXTENSION, | ||
| V1Constants.Indexes.UNSORTED_MV_FORWARD_INDEX_FILE_EXTENSION); | ||
| } |
Copilot
AI
Dec 11, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic for returning file extensions now includes RAW_SV_FORWARD_INDEX_FILE_EXTENSION in all dictionary-enabled single-value cases (lines 254-258), even when sorted or unsorted. This seems inconsistent with the multi-value logic (lines 260-264). Verify that RAW extensions should indeed be included for dictionary-encoded columns, as this appears to contradict the raw-encoding concept.
| new ColumnIndexCreators(columnName, fieldSpec, null, List.of(), | ||
| new NullValueVectorCreator(outDir, columnName))); | ||
| } else { | ||
| LOGGER.info("Column: {} is not nullable", columnName); |
Copilot
AI
Dec 11, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The added null value vector handling logs both nullable and non-nullable columns at INFO level. This could create excessive logging for tables with many columns. Consider logging only nullable columns, or reducing non-nullable logging to DEBUG level.
| LOGGER.info("Column: {} is not nullable", columnName); | |
| LOGGER.debug("Column: {} is not nullable", columnName); |
| } else { | ||
| disableDictionaryAndCreateRawForwardIndex(column, segmentWriter); | ||
| FieldIndexConfigs fieldIndexConfig = _fieldIndexConfigs.get(column); | ||
| ForwardIndexConfig forwardIndexConfig = fieldIndexConfig.getConfig(StandardIndexes.forward()); | ||
| if (forwardIndexConfig.isRawEncoding()) { | ||
| disableDictionary(column, segmentWriter, "Disable dictionary when only raw forward index exists"); | ||
| } else { | ||
| disableDictionaryAndCreateRawForwardIndex(column, segmentWriter); | ||
| } | ||
| } |
Copilot
AI
Dec 11, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The nested if-else chain starting at line 169 is complex and could benefit from early returns or extraction into separate methods. The logic handling DISABLE_DICTIONARY has three distinct paths (no forward index, raw forward index, and dictionary forward index) that would be clearer with guard clauses.
| } else { | |
| disableDictionaryAndCreateRawForwardIndex(column, segmentWriter); | |
| FieldIndexConfigs fieldIndexConfig = _fieldIndexConfigs.get(column); | |
| ForwardIndexConfig forwardIndexConfig = fieldIndexConfig.getConfig(StandardIndexes.forward()); | |
| if (forwardIndexConfig.isRawEncoding()) { | |
| disableDictionary(column, segmentWriter, "Disable dictionary when only raw forward index exists"); | |
| } else { | |
| disableDictionaryAndCreateRawForwardIndex(column, segmentWriter); | |
| } | |
| } | |
| break; | |
| } | |
| FieldIndexConfigs fieldIndexConfig = _fieldIndexConfigs.get(column); | |
| ForwardIndexConfig forwardIndexConfig = fieldIndexConfig.getConfig(StandardIndexes.forward()); | |
| if (forwardIndexConfig.isRawEncoding()) { | |
| disableDictionary(column, segmentWriter, "Disable dictionary when only raw forward index exists"); | |
| break; | |
| } | |
| disableDictionaryAndCreateRawForwardIndex(column, segmentWriter); |
| assertEquals(response.get("numEntriesScannedInFilter").asLong(), 0L, | ||
| "Inverted index should avoid scanning filters even for raw forward index"); |
Copilot
AI
Dec 11, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The assertion verifies that numEntriesScannedInFilter is 0, expecting the inverted index to eliminate filter scanning. However, the test doesn't verify that the inverted index was actually used. Consider adding an assertion to check query plan or execution statistics that confirm inverted index usage.
| @Test | ||
| public void testRawForwardColumnIndexAddAndRemove() | ||
| throws Exception { | ||
| String filterValue = escapeSingleQuotes(getMostFrequentColumn5Value()); | ||
| String filterQuery = String.format("SELECT COUNT(*) FROM %s WHERE column5 = '%s'", RAW_TABLE_NAME, filterValue); | ||
|
|
||
| BrokerResponseNative baselineResponse = getBrokerResponse(filterQuery); | ||
| assertTrue(baselineResponse.getExceptions() == null || baselineResponse.getExceptions().size() == 0); | ||
| ResultTable baselineResultTable = baselineResponse.getResultTable(); | ||
| assertNotNull(baselineResultTable); | ||
| List<Object[]> baselineRows = baselineResultTable.getRows(); | ||
| long baselineEntriesScanned = baselineResponse.getNumEntriesScannedInFilter(); | ||
| assertTrue(baselineEntriesScanned > 0); | ||
| } |
Copilot
AI
Dec 11, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test method testRawForwardColumnIndexAddAndRemove() establishes a baseline but doesn't perform any index add/remove operations or validate behavior changes. The test appears incomplete and should either be removed or expanded to test the full add/remove cycle it claims to test.
Infer dictionary creation whenever indexes that require a dictionary (inverted, FST, range, etc.) are configured, and plumb raw-encoding awareness through forward index creator/reader factories.
Extend forward index handler logic to handle raw-forward columns with index add/remove flows, compression updates, and dictionary toggles more robustly.
Add integration and unit tests: new raw forward + inverted index integration test, expanded reload coverage, and updated forward index config/type utilities and defaults.