-
Notifications
You must be signed in to change notification settings - Fork 517
SYSTEMDS-3539 Implement delta encoding (Parts 1, 2, and 3) #2361
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Thanks for the good first PR @HanaHalitim which also includes plenty of tests . There are a few things that need to be addressed. I will leave some comments in the code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would avoid adding these empty constructors across multiple files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You extend ColGroupDDC, which implements many methods for the context of DDC. However, these implementations are often incorrect in the context of delta DDC. Please override these methods and at least throw a NotImplementException.
| // } | ||
| public static AColGroup create(IColIndex colIndexes, IDictionary dict, AMapToData data, int[] cachedCounts) { | ||
| if(data.getUnique() == 1) | ||
| return ColGroupConst.create(colIndexes, dict); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is incorrect for the general case. If your dictionary contains a constant value that is nonzero, it does not represent a constant colgroup in delta encoding. It only holds if the only entry of the dict is 0.
| for(int j = 0; j < nCol; j++) { | ||
| prevRow[j] = prevRowData[prevOff + _colIndexes.get(j)]; | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The handling of the case rl > 0 is incorrect. The intent of this method is to copy the data from the compressed colgroup into the DenseBlock with some offset. The DenseBlock does not hold the previous row data as is assumed by your code. Instead, you would have to aggregate this compressed colgroup from row = 0 to row = rl to compute the contents of the previous row.
You should also add some tests to cover that case.
| oldIdToNewId[dac.id] = i; | ||
| idx += colIndexes.size(); | ||
| } | ||
| IDictionary dict = new DeltaDictionary(dictValues, colIndexes.size()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dictionary creation does not account for extra = true, which then contains one additional value. You can either support this case or throw an error if extra == true to ensure correctness
|
|
||
| return finalResult; | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Compression should only be applied if it is beneficial rather than enforcing it. For example, if we have CUMSUM and DDC colgroups, then we can just reinterpret them as DeltaDDC.
|
|
||
| package org.apache.sysds.test.component.compress.estim.encoding; | ||
|
|
||
| import static org.junit.Assert.assertEquals; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unnecessary import
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, these tests should be a bit more meaningful. A lot of these asserts are just null checks and not exact verifications. Also, combine is never tested
Implemented delta encoding compression for SystemDS. Added ColGroupDeltaDDC compression type that stores row differences instead of absolute values, improving compression for data with predictable patterns.
Created delta readers that compute row differences on-the-fly during compression, avoiding delta matrix materialization. Wired CUMSUM and ROWCUMSUM operations to automatically use delta encoding for their results.
Extended compression estimation with preferDeltaEncoding flag to evaluate delta encoding as a compression option. Fixed dictionary remapping bug where extractValues() reordered entries, breaking row-to-dictionary mappings.
All tests pass including ColGroupDeltaDDCTest, CLALibUnaryDeltaTest, ReadersDeltaTest, and related compression tests.