Skip to content

Conversation

@HanaHalitim
Copy link

Implemented delta encoding compression for SystemDS. Added ColGroupDeltaDDC compression type that stores row differences instead of absolute values, improving compression for data with predictable patterns.

Created delta readers that compute row differences on-the-fly during compression, avoiding delta matrix materialization. Wired CUMSUM and ROWCUMSUM operations to automatically use delta encoding for their results.

Extended compression estimation with preferDeltaEncoding flag to evaluate delta encoding as a compression option. Fixed dictionary remapping bug where extractValues() reordered entries, breaking row-to-dictionary mappings.

All tests pass including ColGroupDeltaDDCTest, CLALibUnaryDeltaTest, ReadersDeltaTest, and related compression tests.

@janniklinde
Copy link
Contributor

Thanks for the good first PR @HanaHalitim which also includes plenty of tests . There are a few things that need to be addressed. I will leave some comments in the code.

Copy link
Contributor

@janniklinde janniklinde Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would avoid adding these empty constructors across multiple files.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You extend ColGroupDDC, which implements many methods for the context of DDC. However, these implementations are often incorrect in the context of delta DDC. Please override these methods and at least throw a NotImplementException.

// }
public static AColGroup create(IColIndex colIndexes, IDictionary dict, AMapToData data, int[] cachedCounts) {
if(data.getUnique() == 1)
return ColGroupConst.create(colIndexes, dict);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is incorrect for the general case. If your dictionary contains a constant value that is nonzero, it does not represent a constant colgroup in delta encoding. It only holds if the only entry of the dict is 0.

for(int j = 0; j < nCol; j++) {
prevRow[j] = prevRowData[prevOff + _colIndexes.get(j)];
}
}
Copy link
Contributor

@janniklinde janniklinde Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The handling of the case rl > 0 is incorrect. The intent of this method is to copy the data from the compressed colgroup into the DenseBlock with some offset. The DenseBlock does not hold the previous row data as is assumed by your code. Instead, you would have to aggregate this compressed colgroup from row = 0 to row = rl to compute the contents of the previous row.

You should also add some tests to cover that case.

oldIdToNewId[dac.id] = i;
idx += colIndexes.size();
}
IDictionary dict = new DeltaDictionary(dictValues, colIndexes.size());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dictionary creation does not account for extra = true, which then contains one additional value. You can either support this case or throw an error if extra == true to ensure correctness


return finalResult;
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Compression should only be applied if it is beneficial rather than enforcing it. For example, if we have CUMSUM and DDC colgroups, then we can just reinterpret them as DeltaDDC.


package org.apache.sysds.test.component.compress.estim.encoding;

import static org.junit.Assert.assertEquals;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary import

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, these tests should be a bit more meaningful. A lot of these asserts are just null checks and not exact verifications. Also, combine is never tested

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

3 participants