-
Notifications
You must be signed in to change notification settings - Fork 92
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FLINK-32293] Support sparse vector with long index #242
Conversation
280156f
to
02154aa
Compare
@lindong28 @HuangXingBo Can you help to review this PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR. Left some comments below.
I created a commit to demonstrate an alternative way to support SparseVector with long index that hopefully can address most of comments mentioned below.
https://github.com/lindong28/flink-ml/commits/FLINK-32293-backup
@TypeInfo(VectorTypeInfoFactory.class) | ||
@PublicEvolving | ||
public interface Vector extends Serializable { | ||
public interface Vector<K extends Number, V extends Number> extends Serializable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that we only need to store values as double would it be simpler to remove V extends Number
?
We can extend the interface to support using float as the value type. But given that Spark ML does not provide this optimization, it is not clear whether/when we will ever need it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think keeping V extends Number
would make this interface more scalable. Though Spark does not support float32 as value, but storing data in other data types (e.g., float32) does save the space.
I searched the data types used in some machine learning libraries and math lib.
- Breeze supports int-key and double/float/int/long values [1]. Breeze is used by Spark ML, but Spark ML restricts the value type as double.
- Sklearn uses numpy array as input, the data types of numpy array could be float16, single, double etc. [2]
- TensorFlow/PyTorch supports different data types of tensors [3]
[1] https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/linalg/DenseVector.scala
[2] https://numpy.org/doc/stable/user/basics.types.html
[3] https://www.tensorflow.org/guide/tensor#more_on_dtypes
|
||
/** Gets the value of the ith element. */ | ||
double get(int i); | ||
V get(K i); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be simpler to just use double get(long i)
?
|
||
/** Sets the value of the ith element. */ | ||
void set(int i, double value); | ||
void set(K i, V value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be simpler to use void set(long i, double value)
?
|
||
/** Gets the size of the vector. */ | ||
int size(); | ||
K size(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be simpler to use long size()
?
/** Converts the instance to a double array. */ | ||
double[] toArray(); | ||
/** Converts the instance to a primitive array. */ | ||
Object toArray(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The implementation of this API always return double[]
. Would it be simpler to still use double[] toArray()
?
public class SparseLongDoubleVector implements LongDoubleVector { | ||
|
||
public final long n; | ||
public long[] indices; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be useful to make these fields private and expose them via methods (e.g. getIndices()
or getIndicesAsLongArray()
) so that we can use the same caller code for different implementations of SparseVector?
|
||
/** Converts the instance to a sparse vector. */ | ||
SparseVector toSparse(); | ||
Vector<K, V> toSparse(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The intention of defining this method in an interface is to have the caller use the return value via public APIs of Vector<K, V>
.
However, it appears that caller of this method uses return values via public fields/APIs that do not belong to Vector<K, V>
. For example, MinHashLSHModelData#hashFunction()
accesses the field SparseIntDoubleVector#indices
via vec.toSparse().indices
. The code will break if we change the concrete class from SparseIntDoubleVector
to SparseLongDoubleVector
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have overrode the return type in IntDoubleVector
and LongDoubleVector
, since for different instances toSparse
should return different types of vectors.
We can talk about this offline.
…seVector as SparseVector, DenseVector, SparseIntDoubleVector and DenseIntDoubleVector; make class members of vectors private and fix some bugs introduced by the last commit
What is the purpose of the change
Add support for sparse vector with long as index and double as value.
Brief change log
SparseVector
asSparseIntDoubleVector
andDenseVector
asDenseIntDoubleVector
.Vector
interface.IntDoubleVector
andLongDoubleVector
for vectors with int keys and double values, long keys and double values, respectively.Does this pull request potentially affect one of the following parts:
@Public(Evolving)
: (yes)Documentation