Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-33003] Support isolation forest algorithm in Flink ML #253

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

zhaozijun109
Copy link

@zhaozijun109 zhaozijun109 commented Aug 31, 2023

What is the purpose of the change

This pull request adds the implementation of Isolation forest algorithm.

Brief change log

  • Adds Isolation forest algorithm inJava (flink-ml-lib)

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): (no)
  • The public API, i.e., is any changed class annotated with @Public(Evolving): (no)

Documentation

  • Does this pull request introduce a new feature? (no)
  • If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented)

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
@zhaozijun109
Copy link
Author

zhaozijun109 commented Sep 1, 2023

Hello @lindong28 @zhipeng93 teachers, this problem seems that the beam package in pyflink has changed. How should I solve this problem?

Copy link
Member

@lindong28 lindong28 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Left some comments below.

Can you help make sure that all tests can pass successfully?

public static class ITree implements Serializable {
public int attributeIndex;
public double splitAttributeValue;
public ITree leftTree, rightTree;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we declare one variable at a line for consistency with the existing code style?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok.

…ut the IForest and ITree in IsolationForest and add test case.)
@zhaozijun109
Copy link
Author

@lindong28 Please review again teacher,thank you.

Copy link
Member

@lindong28 lindong28 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution and the PR update! I left some comments below.

Sorry, I might not have time to actively review this PR later. I will ask if some of my colleagues can help review this PR.

The PR currently fails compilation due to the following error in the log [1]. I don't think it is related to either pyflink or beam. But I probably won't be able to dig into this anytime soon.

Error:  Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.8.0:compile (default-compile) on project flink-ml-lib-1.17: Compilation failure: Compilation failure: 
Error:  /home/runner/work/flink-ml/flink-ml/flink-ml-lib/src/main/java/org/apache/flink/ml/anomalydetection/isolationforest/IsolationForest.java:[302,21] cannot infer type arguments for org.apache.flink.iteration.datacache.nonkeyed.ListStateWithCache<>
Error:    reason: cannot infer type-variable(s) T
Error:      (actual and formal argument lists differ in length)

[1] https://github.com/apache/flink-ml/actions/runs/6092176700/job/16543229469?pr=253

"The number of features used to train each tree and it is treated as a fraction in the range (0, 1.0].",
1.0);

default int getNumTrees() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that this method is already defined in IsolationForestModelParams, should we remove it here?

Same for setNumTrees().

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, teacher, I have deleted him in second pr which will overwrite the first pr.


public void generateIsolationForest(DenseVector[] samplesData, int[] featureIndices) {
int n = samplesData.length;
this.subSamplesSize = Math.min(256, n);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nits: would it be more consistent with existing code to use subSamplesSize directly here?

Same for other usages of this.* outside constructor.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I will fix it.

}
}

assert tmpITree != null;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general we only use assert in unit tests.

Given that we will have NullPointerException in the line below if tmpITree == null, it seems simpler to just remove this line.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I will delete it.

public final double splitAttributeValue;
public ITree leftTree;
public ITree rightTree;
public int currentHeight;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that we only mutate currentHeight right after it is constructed, it should be possible and better to make it final.

Same for the other three non-final variables.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good for me, I will make it final.

@zhaozijun109
Copy link
Author

zhaozijun109 commented Sep 7, 2023

Thanks for your contribution and the PR update! I left some comments below.

Sorry, I might not have time to actively review this PR later. I will ask if some of my colleagues can help review this PR.

The PR currently fails compilation due to the following error in the log [1]. I don't think it is related to either pyflink or beam. But I probably won't be able to dig into this anytime soon.

Error:  Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.8.0:compile (default-compile) on project flink-ml-lib-1.17: Compilation failure: Compilation failure: 
Error:  /home/runner/work/flink-ml/flink-ml/flink-ml-lib/src/main/java/org/apache/flink/ml/anomalydetection/isolationforest/IsolationForest.java:[302,21] cannot infer type arguments for org.apache.flink.iteration.datacache.nonkeyed.ListStateWithCache<>
Error:    reason: cannot infer type-variable(s) T
Error:      (actual and formal argument lists differ in length)

[1] https://github.com/apache/flink-ml/actions/runs/6092176700/job/16543229469?pr=253

Teacher, this problem seems to be ListStateWithCache class. I am not sure if anyone has submitted the relevant PR. I will rebuild the branch based on the latest master, test it and resubmit it.Thank you for your suggestion and review, best to you.

@zhaozijun109
Copy link
Author

Thanks for your contribution and the PR update! I left some comments below.

Sorry, I might not have time to actively review this PR later. I will ask if some of my colleagues can help review this PR.

The PR currently fails compilation due to the following error in the log [1]. I don't think it is related to either pyflink or beam. But I probably won't be able to dig into this anytime soon.

Error:  Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.8.0:compile (default-compile) on project flink-ml-lib-1.17: Compilation failure: Compilation failure: 
Error:  /home/runner/work/flink-ml/flink-ml/flink-ml-lib/src/main/java/org/apache/flink/ml/anomalydetection/isolationforest/IsolationForest.java:[302,21] cannot infer type arguments for org.apache.flink.iteration.datacache.nonkeyed.ListStateWithCache<>
Error:    reason: cannot infer type-variable(s) T
Error:      (actual and formal argument lists differ in length)

[1] https://github.com/apache/flink-ml/actions/runs/6092176700/job/16543229469?pr=253

This is due to this PR (ebdf362) induced,I will fix the problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants