-
Notifications
You must be signed in to change notification settings - Fork 92
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FLINK-33003] Support isolation forest algorithm in Flink ML #253
base: master
Are you sure you want to change the base?
Conversation
Hello @lindong28 @zhipeng93 teachers, this problem seems that the beam package in pyflink has changed. How should I solve this problem? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! Left some comments below.
Can you help make sure that all tests can pass successfully?
...-lib/src/main/java/org/apache/flink/ml/anomalydetection/isolationforest/IsolationForest.java
Outdated
Show resolved
Hide resolved
...rc/main/java/org/apache/flink/ml/anomalydetection/isolationforest/IsolationForestParams.java
Outdated
Show resolved
Hide resolved
...rc/main/java/org/apache/flink/ml/anomalydetection/isolationforest/IsolationForestParams.java
Outdated
Show resolved
Hide resolved
...rc/main/java/org/apache/flink/ml/anomalydetection/isolationforest/IsolationForestParams.java
Outdated
Show resolved
Hide resolved
...src/main/java/org/apache/flink/ml/anomalydetection/isolationforest/IsolationForestModel.java
Outdated
Show resolved
Hide resolved
...-lib/src/main/java/org/apache/flink/ml/anomalydetection/isolationforest/IsolationForest.java
Outdated
Show resolved
Hide resolved
public static class ITree implements Serializable { | ||
public int attributeIndex; | ||
public double splitAttributeValue; | ||
public ITree leftTree, rightTree; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we declare one variable at a line for consistency with the existing code style?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok.
...main/java/org/apache/flink/ml/anomalydetection/isolationforest/IsolationForestModelData.java
Outdated
Show resolved
Hide resolved
...-lib/src/main/java/org/apache/flink/ml/anomalydetection/isolationforest/IsolationForest.java
Outdated
Show resolved
Hide resolved
...-lib/src/main/java/org/apache/flink/ml/anomalydetection/isolationforest/IsolationForest.java
Outdated
Show resolved
Hide resolved
…ut the IForest and ITree in IsolationForest and add test case.)
@lindong28 Please review again teacher,thank you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your contribution and the PR update! I left some comments below.
Sorry, I might not have time to actively review this PR later. I will ask if some of my colleagues can help review this PR.
The PR currently fails compilation due to the following error in the log [1]. I don't think it is related to either pyflink or beam. But I probably won't be able to dig into this anytime soon.
Error: Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.8.0:compile (default-compile) on project flink-ml-lib-1.17: Compilation failure: Compilation failure:
Error: /home/runner/work/flink-ml/flink-ml/flink-ml-lib/src/main/java/org/apache/flink/ml/anomalydetection/isolationforest/IsolationForest.java:[302,21] cannot infer type arguments for org.apache.flink.iteration.datacache.nonkeyed.ListStateWithCache<>
Error: reason: cannot infer type-variable(s) T
Error: (actual and formal argument lists differ in length)
[1] https://github.com/apache/flink-ml/actions/runs/6092176700/job/16543229469?pr=253
"The number of features used to train each tree and it is treated as a fraction in the range (0, 1.0].", | ||
1.0); | ||
|
||
default int getNumTrees() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that this method is already defined in IsolationForestModelParams
, should we remove it here?
Same for setNumTrees()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, teacher, I have deleted him in second pr which will overwrite the first pr.
|
||
public void generateIsolationForest(DenseVector[] samplesData, int[] featureIndices) { | ||
int n = samplesData.length; | ||
this.subSamplesSize = Math.min(256, n); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nits: would it be more consistent with existing code to use subSamplesSize
directly here?
Same for other usages of this.*
outside constructor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I will fix it.
} | ||
} | ||
|
||
assert tmpITree != null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general we only use assert
in unit tests.
Given that we will have NullPointerException in the line below if tmpITree == null
, it seems simpler to just remove this line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I will delete it.
public final double splitAttributeValue; | ||
public ITree leftTree; | ||
public ITree rightTree; | ||
public int currentHeight; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that we only mutate currentHeight
right after it is constructed, it should be possible and better to make it final
.
Same for the other three non-final variables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good for me, I will make it final.
Teacher, this problem seems to be ListStateWithCache class. I am not sure if anyone has submitted the relevant PR. I will rebuild the branch based on the latest master, test it and resubmit it.Thank you for your suggestion and review, best to you. |
This is due to this PR (ebdf362) induced,I will fix the problem. |
What is the purpose of the change
This pull request adds the implementation of Isolation forest algorithm.
Brief change log
Does this pull request potentially affect one of the following parts:
@Public(Evolving)
: (no)Documentation