Use same meta-info part for Java and spark SDK #379

Thespica · 2024-02-26T10:10:47Z

Thespica
Feb 26, 2024
Collaborator

Java and scala work on JVM and can call each other's code directly. We can let Java SDK and spark SDK use the same meta-info part so
that we can avoid duplicate work when updating.

For the meta-info part, there are two options:

Rewrite by Java, but may affect spark SDK's existing code
Let Java use meta-info written in scala.

I'm not sure about how to choose.

Related issue: #276

SemyonSinchenko · 2024-02-26T10:25:11Z

SemyonSinchenko
Feb 26, 2024
Collaborator

I think that scala-info objects are not the better choice just because of scala. For PySpark part calling Java code better than Scala too (py4j is not well suited for scala).

I suggest have:

Java submodule with Info meta-objetcs and serialization/deserialization (need to think how to work with S3/HDFS yaml files)
Java submodule with writter/reader logic that depends on Java-meta
Scala Spark datasources submodules (spark 32, spark 33, etc.)
Scala Spark helpers submodule (read/write/index) that depends on Java meta and chosen datasource implementation for a specific version of spark
pyspark bindings to both java meta and scala helpers

1 reply

SemyonSinchenko Feb 26, 2024
Collaborator

It should also simplify testsing and increase the CI-stages speed

acezen · 2024-02-26T12:19:32Z

acezen
Feb 26, 2024
Collaborator

Agree to Sem, The meta-part is the core and implementation of GraphAr format, unified with Java submodule would help to keep the format implementation alignment between libraries.

0 replies

Thespica · 2024-02-28T05:36:18Z

Thespica
Feb 28, 2024
Collaborator Author

Which naming convention should we use, remain existing style or use normal Java naming convention？

existing style:

bool is_nullable;
bool getIs_nullable();
void setIs_nullable(bool is_nullable);

normal naming convention in Java:

bool nullable;
bool isNullable();
void setNullable(bool nullable);

bool naming convention in java

1 reply

acezen Feb 28, 2024
Collaborator

I suggest we use the Java naming convention. It's good to read and follow the best practice.

you can refer to the convention of google java style

SemyonSinchenko · 2024-03-12T20:57:38Z

SemyonSinchenko
Mar 12, 2024
Collaborator

I see it like this:

We have:

top-level Java Info implementations
standalone Java read-write (via java parquet/orc libs) depends of Java Info
spark datasources: ds-32, ds-33, etc.
spark Read/Write/GenerateIndex/Util depends of Java Info AND of one of datasources
python/pyspark provide py4j free implementation of Info AND pure PySpark Read/Write/Index/Util that depends of one of datasources

With this case PySpark implementation will work on both regular PySpark and SparkConnect jobs. And also there is an option to use pure java (with provided Info) without Spark (do we really need it?)

@Thespica @acezen FYI as a continue of the discussion in #401

1 reply

acezen Mar 21, 2024
Collaborator

Fantastic work！thanks Sem. About the pure java without Spark option, yes, there is requirement like HugeGraph that need a JAVA read/write to process the GraphAr data.

Thespica · 2024-03-21T04:14:47Z

Thespica
Mar 21, 2024
Collaborator Author

A point needs to be discussed here about immutability.

We hope meta-info is immutable for ability of parallel and safety, but the final keep word is not enough to contract that, it only ensure
the const of pointer, rather than the whole object.

So we need some immutable class to replace original collection like List, Map and etc. Here are two options I found:

use UnmodifiableXxx(i.e UnmodifiableMap) from java collection. It also can be modified from original map, but we can avoid by copying as they discussed.
use ImmutableXxx(i.e ImmutableMap) from guava. It's really immutable, but we need add a new dependency.

I prefer the first now, but I'm not sure about that, so I need your perspective.

Besides, we should return UnmodifiableXxx or ImmutableXxx directly rather than returning interface(i.e Map or List) to hint user that the object cannot be modified.

2 replies

SemyonSinchenko Mar 21, 2024
Collaborator

Option one looks better for me

acezen Mar 21, 2024
Collaborator

Option one looks good to me too.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use same meta-info part for Java and spark SDK #379

{{title}}

Replies: 5 comments 5 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Use same meta-info part for Java and spark SDK #379

Thespica Feb 26, 2024 Collaborator

Replies: 5 comments · 5 replies

SemyonSinchenko Feb 26, 2024 Collaborator

SemyonSinchenko Feb 26, 2024 Collaborator

acezen Feb 26, 2024 Collaborator

Thespica Feb 28, 2024 Collaborator Author

acezen Feb 28, 2024 Collaborator

SemyonSinchenko Mar 12, 2024 Collaborator

acezen Mar 21, 2024 Collaborator

Thespica Mar 21, 2024 Collaborator Author

SemyonSinchenko Mar 21, 2024 Collaborator

acezen Mar 21, 2024 Collaborator

Thespica
Feb 26, 2024
Collaborator

Replies: 5 comments 5 replies

SemyonSinchenko
Feb 26, 2024
Collaborator

SemyonSinchenko Feb 26, 2024
Collaborator

acezen
Feb 26, 2024
Collaborator

Thespica
Feb 28, 2024
Collaborator Author

acezen Feb 28, 2024
Collaborator

SemyonSinchenko
Mar 12, 2024
Collaborator

acezen Mar 21, 2024
Collaborator

Thespica
Mar 21, 2024
Collaborator Author

SemyonSinchenko Mar 21, 2024
Collaborator

acezen Mar 21, 2024
Collaborator