Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-32293] Support sparse vector with long index #242

Closed
wants to merge 2 commits into from

Conversation

zhipeng93
Copy link
Contributor

@zhipeng93 zhipeng93 commented Jun 9, 2023

What is the purpose of the change

Add support for sparse vector with long as index and double as value.

Brief change log

  • Renamed SparseVector as SparseIntDoubleVector and DenseVector as DenseIntDoubleVector.
  • Added generics in Vector interface.
  • Added IntDoubleVector and LongDoubleVector for vectors with int keys and double values, long keys and double values, respectively.
  • Added test for SparseLongDouble vector.
  • Added the coresponding python changes.
  • Updated the corresponding flink ml docs.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): (no)
  • The public API, i.e., is any changed class annotated with @Public(Evolving): (yes)

Documentation

  • Does this pull request introduce a new feature? (yes)
  • If yes, how is the feature documented? (JavaDocs)

@zhipeng93 zhipeng93 force-pushed the FLINK-32293 branch 4 times, most recently from 280156f to 02154aa Compare June 9, 2023 09:38
@zhipeng93
Copy link
Contributor Author

@lindong28 @HuangXingBo Can you help to review this PR?

@lindong28 lindong28 self-requested a review June 10, 2023 01:01
Copy link
Member

@lindong28 lindong28 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. Left some comments below.

I created a commit to demonstrate an alternative way to support SparseVector with long index that hopefully can address most of comments mentioned below.

https://github.com/lindong28/flink-ml/commits/FLINK-32293-backup

@TypeInfo(VectorTypeInfoFactory.class)
@PublicEvolving
public interface Vector extends Serializable {
public interface Vector<K extends Number, V extends Number> extends Serializable {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that we only need to store values as double would it be simpler to remove V extends Number?

We can extend the interface to support using float as the value type. But given that Spark ML does not provide this optimization, it is not clear whether/when we will ever need it.

Copy link
Contributor Author

@zhipeng93 zhipeng93 Jun 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think keeping V extends Number would make this interface more scalable. Though Spark does not support float32 as value, but storing data in other data types (e.g., float32) does save the space.

I searched the data types used in some machine learning libraries and math lib.

  • Breeze supports int-key and double/float/int/long values [1]. Breeze is used by Spark ML, but Spark ML restricts the value type as double.
  • Sklearn uses numpy array as input, the data types of numpy array could be float16, single, double etc. [2]
  • TensorFlow/PyTorch supports different data types of tensors [3]

[1] https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/linalg/DenseVector.scala
[2] https://numpy.org/doc/stable/user/basics.types.html
[3] https://www.tensorflow.org/guide/tensor#more_on_dtypes


/** Gets the value of the ith element. */
double get(int i);
V get(K i);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be simpler to just use double get(long i)?


/** Sets the value of the ith element. */
void set(int i, double value);
void set(K i, V value);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be simpler to use void set(long i, double value)?


/** Gets the size of the vector. */
int size();
K size();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be simpler to use long size()?

/** Converts the instance to a double array. */
double[] toArray();
/** Converts the instance to a primitive array. */
Object toArray();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation of this API always return double[]. Would it be simpler to still use double[] toArray()?

public class SparseLongDoubleVector implements LongDoubleVector {

public final long n;
public long[] indices;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be useful to make these fields private and expose them via methods (e.g. getIndices() or getIndicesAsLongArray()) so that we can use the same caller code for different implementations of SparseVector?


/** Converts the instance to a sparse vector. */
SparseVector toSparse();
Vector<K, V> toSparse();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intention of defining this method in an interface is to have the caller use the return value via public APIs of Vector<K, V>.

However, it appears that caller of this method uses return values via public fields/APIs that do not belong to Vector<K, V>. For example, MinHashLSHModelData#hashFunction() accesses the field SparseIntDoubleVector#indices via vec.toSparse().indices. The code will break if we change the concrete class from SparseIntDoubleVector to SparseLongDoubleVector.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have overrode the return type in IntDoubleVector and LongDoubleVector, since for different instances toSparse should return different types of vectors.

We can talk about this offline.

@zhipeng93 zhipeng93 marked this pull request as draft June 13, 2023 02:36
…seVector as SparseVector, DenseVector, SparseIntDoubleVector and DenseIntDoubleVector; make class members of vectors private and fix some bugs introduced by the last commit
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants