Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Protobuf OneOf type declarations to avoid an intermediate OneOf Type class #2918

Open
rocketraman opened this issue Feb 3, 2025 · 17 comments

Comments

@rocketraman
Copy link

rocketraman commented Feb 3, 2025

What is your use-case and why do you need this feature?

I am investigating using Protobuf to serialize polymorphic types. Using an example similar to the one in the guide, I have to declare my an example hierarchy that itself contains nested classes like this:

@Serializable
data class Data(
  @ProtoNumber(1) val name: String,
  @ProtoOneOf val phone: IPhoneType?,
)

@Serializable
sealed interface IPhoneType

@Serializable
class HomePhoneType(
  @ProtoNumber(2) val homePhone: HomePhone,
): IPhoneType

@Serializable
class WorkPhoneType(
  @ProtoNumber(3) val workPhone: WorkPhone,
): IPhoneType

@Serializable
data class HomePhone(val number: String)

@Serializable
data class WorkPhone(val number: String, val extension: String)

This results in inconsistent serialization between JSON and Protobuf, the former of which serializes like this:

{
    "name": "Something",
    "phone": {
        "type": "HomePhoneType",
        "homePhone": {
            "foo": "abc"
        }
    }
}

and the latter of which uses a cleaner schema with no trace of the intermediate type:

syntax = "proto2";


message Data {
  required string name = 1;
  oneof phone {
    HomePhone homePhone = 2;
    WorkPhone workPhone = 3;
  }
}

message HomePhone {
  required string foo = 1;
}

message WorkPhone {
  required string bar = 1;
}

Furthermore, the natural way of describing such a data structure in Kotlin would be like this:

@Serializable
data class Data(
  val name: String,
  val phone: IPhone,
)

@Serializable
sealed class IPhone

@Serializable
data class HomePhone(val number: String): IPhone()

@Serializable
data class WorkPhone(val number: String, val extension: String): IPhone()

which would result in the expected JSON, but of course does not work with Protobuf OneOf.

Describe the solution you'd like

I'd like to define the data structure in an idiomatic way, and have the ProtoBuf OneOf work as expected.

It seems like the only value the intermediate OneOf type adds is the definition of the @ProtoNumber for the OneOf.

Therefore, maybe we could define the data structure in Kotlin using annotations like the following:

@Serializable
data class Data(
  @ProtoNumber(1) val name: String,
  @ProtoOneOf val phone: IPhone,
)

@Serializable
sealed class IPhone

@Serializable
@ProtoOneOfNumber(2)
data class HomePhone(val number: String): IPhone()

@Serializable
@ProtoOneOfNumber(3)
data class WorkPhone(val number: String, val extension: String): IPhone()

The result of this would be an idiomatic simpler Kotlin data structure, and consistent form of serialized output for JSON and Protobuf.

@xiaozhikang0916
Copy link
Contributor

In protobuf, data structures are reused at the message level, not oneof. If oneof is supported by this declaration as you propose, the following message definition:

message Data {
  required string name = 1;
  oneof phone {
    HomePhone homePhone = 2;
    WorkPhone workPhone = 3;
  }
}

message Data2 {
  required string name = 1;
  oneof phone {
    HomePhone homePhone = 2;
  }
}

message HomePhone {
  required string foo = 1;
}

message WorkPhone {
  required string bar = 1;
}

will need 2 kt definition to describe HomePhone in Data and Data2. In my product project, it bring a explosive duplication of these class. And it would be impossible to compare data between them.

The result of this would be an idiomatic simpler Kotlin data structure, and consistent form of serialized output for JSON and Protobuf.

I would like to know if there any real use case to require the consistent form of output for these 2 formats. And if any, it would be better to be supported by a well-defined ProtoJSON format, instead of using the origin JSON format.

@rocketraman
Copy link
Author

rocketraman commented Feb 5, 2025

will need 2 kt definition to describe HomePhone in Data and Data2.

Why would it? In that case IPhone can go back to being an interface... it doesn't change my point. Something like this:

@Serializable
data class Data(
  val name: String,
  @ProtoOneOf val phone: IPhone,
)

@Serializable
data class Data2(
  val name: String,
  @ProtoOneOf val phone: IPhone2,
)

@Serializable
sealed interface IPhone

@Serializable
sealed interface IPhone2

@Serializable
@ProtoOneOfNumber(2)
data class HomePhone(val number: String): IPhone, IPhone2

@Serializable
@ProtoOneOfNumber(3)
data class WorkPhone(val number: String, val extension: String): IPhone

With this structure, the only thing that can't be done is that HomePhone cannot have a different proto number for Data and Data2, which is definitely not ideal. If that would be a show-stopper then agreed that this won't work, unless we can find another approach to define the proto numbers for the oneof.

I would like to know if there any real use case to require the consistent form of output for these 2 formats. And if any, it would be better to be supported by a well-defined ProtoJSON format, instead of using the origin JSON format.

Its less about the output form (though making it consistent is a good thing), and more about having an idiomatic representation of model classes in Kotlin, rather than one constrained and determined by the serialization format.

@pdvrieze
Copy link
Contributor

pdvrieze commented Feb 5, 2025

Its less about the output form (though making it consistent is a good thing), and more about having an idiomatic representation of model classes in Kotlin, rather than one constrained and determined by the serialization format.

The kotlinx.serialization library (and formats) is fundamentally a Kotlin model first serialization approach (the serialization follows the Kotlin model). Given that it is a multi-format serialization approach this is somewhat unavoidable due to the infinite target issue. However, some formats allow a bit more customization of the output than others. Xml is fairly flexible (highly customizable), Json has a reasonable amount of knobs and protobuf has little. Even XML, while able to support most formats, has limits on the data structures. Json doesn't have/need the flexibility as by its very nature it has a good mapping to the code. Protobuf is just not very featureful. Note also that flexibility comes with complexity (and likely computational overheads).

Even the most flexible of formats struggles to support arbitrary Kotlin - serialization mappings. The best way to do this is to create your own format. This format could be a delegating format / transforming serializer.

@rocketraman
Copy link
Author

The kotlinx.serialization library (and formats) is fundamentally a Kotlin model first serialization approach (the serialization follows the Kotlin model). Given that it is a multi-format serialization approach this is somewhat unavoidable due to the infinite target issue. However, some formats allow a bit more customization of the output than others.

I understand all that, but Protobuf actually outputs the representation one would expect. Its the Kotlin representation of the Protobuf OneOf that has the extra "type class" ceremony which is at issue here.

@xiaozhikang0916
Copy link
Contributor

cannot have a different proto number for Data and Data2

That is the point.

And you will need to modify the existing class HomePhone for any new defined message that use it. It may be not achievable in some multi-module projects, and it is a violation of the Open–closed principle.

@rocketraman
Copy link
Author

rocketraman commented Feb 6, 2025

cannot have a different proto number for Data and Data2

That is the point.

Ok, a potential solution is to use type annotations:

// example annotation, this would be in the library
@Target(AnnotationTarget.TYPE)
annotation class ProtoOneOfNumber(val value: Int)

@Serializable
data class Data(
  @ProtoNumber(1) val name: String,
  @ProtoOneOf val phone: IPhone,
)

@Serializable
data class Data2(
  @ProtoNumber(1) val name: String,
  @ProtoOneOf val phone: IPhone2,
)

@Serializable
sealed interface IPhone

@Serializable
sealed interface IPhone2

@Serializable
data class HomePhone(val number: String): @ProtoOneOfNumber(2) IPhone, @ProtoOneOfNumber(5) IPhone2

@Serializable
data class WorkPhone(val number: String, val extension: String): @ProtoOneOfNumber(3) IPhone

Now serialization can infer that HomePhone has field number 5 for Data2 (because Data2 uses type Iphone2 with type annotation 5) and field 2 for Data.

If we have additional field numbers we need another interface. But this is analogous to, but cheaper than, creating another HomePhoneType class which is required now. The current solution requires a proliferation of the intermediate type classes:

@Serializable
class HomePhoneType2(
  @ProtoNumber(2) val homePhone: HomePhone,
): IPhoneType

@Serializable
class HomePhoneType5(
  @ProtoNumber(5) val homePhone: HomePhone,
): IPhoneType

...

And you will need to modify the existing class HomePhone for any new defined message that use it. It may be not achievable in some multi-module projects, and it is a violation of the Open–closed principle.

Fair point, but this is a Kotlin modeling issue, not a serialization issue. This same logic applies to models that would be serialized to JSON as well. The point here is to give the modeler the choice on how to create their models. If the modeler wants an intermediate type class, go ahead. But many use cases do not require it. Its certainly unusual in my experience to have a single class participate in multiple sealed hierarchies, so this would almost never be a concern in any of my projects, and I suspect most projects. That being said, I acknowledge here that the need to define the ProtoOneOfNumber increases the chance of requiring multiple sealed class hierarchies solely to define that value. But again, the modeler can fall back on the intermediate type class approach if desired.

@pdvrieze
Copy link
Contributor

pdvrieze commented Feb 6, 2025

and the latter of which uses a cleaner schema with no trace of the intermediate type:

syntax = "proto2";


message Data {
  required string name = 1;
  oneof phone {
    HomePhone homePhone = 2;
    WorkPhone workPhone = 3;
  }
}

message HomePhone {
  required string foo = 1;
}

message WorkPhone {
  required string bar = 1;
}

Actually this is not how the ProtoBuf format serializes. Data is serialized as:
<struct (size...)
1: string name
2 or 3: struct (HomePhoneType or WorkPhoneType) {
1: struct { string number }
} or {
1: struct { string number; string extension }
}

Or in protobuf:

syntax = "proto2";
 
message Data {
   required string name = 1;
   oneof phone {
     HomePhoneType homePhone = 2; //note names disappear in the wire format
     WorkPhoneType workPhone = 3;
   }
 }
 
message HomePhoneType {
  required homePhone: HomePhone;
}

message WorkPhoneType {
  required workPhone: WorkPhone;
}

 message HomePhone {
   required string number = 1;
 }
 
 message WorkPhone {
   required string number = 1;
   optional string extension = 2;
 }

What @ProtoOneOf allows is to use the proto number to determine the serialized child type, rather than storing that type information as a string (with yet another box for the type/value pair).

@rocketraman
Copy link
Author

Actually this is not how the ProtoBuf format serializes. Data is serialized as:

Are you sure? When I use this input:

@Serializable
data class Data(
  @ProtoNumber(1) val name: String,
  @ProtoOneOf val phone: IPhoneType?,
)

@Serializable
sealed interface IPhoneType

@Serializable
class HomePhoneType(
  @ProtoNumber(2) val homePhone: HomePhone,
): IPhoneType

@Serializable
class WorkPhoneType(
  @ProtoNumber(3) val workPhone: WorkPhone,
): IPhoneType

@Serializable
data class HomePhone(val number: String)

@Serializable
data class WorkPhone(val number: String, val extension: String)

then the output of:

println(ProtoBufSchemaGenerator.generateSchemaText(Data.serializer().descriptor))

is:

syntax = "proto2";

message Data {
  required string name = 1;
  oneof phone {
    HomePhone homePhone = 2;
    WorkPhone workPhone = 3;
  }
}

message HomePhone {
  required string number = 1;
}

message WorkPhone {
  required string number = 1;
  required string extension = 2;
}

and encoding a protobuf like this:

println(ProtoBuf.encodeToByteArray(Data("foo", HomePhoneType(HomePhone("5551212")))).toHexString())
// result: 0a03666f6f12090a0735353531323132

and decoding that with protoc:

# protoc --decode_raw < Data.bin
1: "foo"
2 {
  1: "5551212"
}

# protoc --decode=Data --proto_path=(pwd) Data.proto < Data.bin
name: "foo"
homePhone {
  number: "5551212"
}

@pdvrieze
Copy link
Contributor

pdvrieze commented Feb 6, 2025

@rocketraman You are right. It would be good to avoid the single attribute wrapper (I believe it was introduced because @ProtoNumber only works on attributes. The wrapper approach also allows the number used to be independent from the type (which would require it to be the same globally).

The other factor is that in protobuf schemas the oneOf concept only requires mutual exclusion, but not a unified type.

@xiaozhikang0916
Copy link
Contributor

First of all, I agree that the wapper type here in oneof is quite redundent and would be good to omit it. But this implementation is the best one for now.
The point is that, type declaration in Kotlin cannot match the needs of ProtoBuf oneof field without the help of Struct in C or Union Type in TypeScripe, so an intermediate wrapper type is introduced as the isolation between the type using oneof message and the inner type.

Indeed, directly implementing sealed interface in the inner oneof Kotlin data class (lets call it the simple impl from here) make it simpler in some case, but it has some deadly limitations

  1. violation of OCP, as described before, and
  2. implementing sealed interface across module is impossible.

So the simple impl would make it impossible to use oneof if data definitions are devided into multi modules, which is our produce project looks like now.

but this is a Kotlin modeling issue, not a serialization issue.

That is true, but

The point here is to give the modeler the choice on how to create their models.

not agreed. IMO, modeling in ProtoBuf format is to create definitions in .proto files, not Kotlin data classes. Modeler is free to choose if using oneof keyword in messages. What should be done in kt class is to translate the proto message into a kt model. You can declare the kt class in the current oneof impl, or just declare all the oneof fields in plain way, losing the compile-time gurantee of fields checking.

To sum up, the current impl is not the perfect solution, but what I can do most for now to make it functionable for most cases, both in simple project and multi-module project.

@glureau
Copy link

glureau commented Feb 7, 2025

I'm in the process of writing my own serializer tool, partially due to the lack of flexibility with the oneOf in ktx serialization. In my project, I'm using an annotation to be able to specify proto numbers of each implementations (open polymorphism on multiple modules). If ever it can help, it looks like that

@ProtoPolymorphism(
    MyInterface::class, // Define on a "common" module
    [
      Pair(MyFirstSubClass::class, 1), // Implements MyInterface in a module A
      Pair(MySecondSubClass::class, 2) //  Implements MyInterface in a module B
    ]
)
private object K2PBPolymorphismConfigHolder // Define on my app module

This is allowing a better class re-usability, and a definition of those proto numbers for many different cases. (More similar to the protobuf schema description.)

Note that I'm producing a message MyInterface which is only composed of the oneOf

message MyInterface {
    oneof myinterface {
        MyFirstSubClass myFirstSubClass = 1;
        MySecondSubClass mySecondSubClass = 2;
    }
}

@pdvrieze
Copy link
Contributor

pdvrieze commented Feb 7, 2025

@glureau You could just create your own variant of the Proto format that has your own annotations/semantics. But indeed, it is possible to specify mappings in an annotation, but said annotation will have to live with the limitations that annotations have (Pair is not valid)

@glureau
Copy link

glureau commented Feb 7, 2025

Oh i'm sorry, I forgot to share the annotation declaration

@Repeatable
public annotation class ProtoPolymorphism(
    val parent: KClass<*>,
    val oneOf: Array<Pair>,
) {
    public annotation class Pair(val kClass: KClass<*>, val number: Int)
}

(Didn't find a better way to write it for now, but I just wanted to share the idea of 1 annotation in the "final" module to declare the polymorphism/oneof setup.)

@xiaozhikang0916
Copy link
Contributor

@glureau can you share your kt data class declaration of these messages?

@glureau
Copy link

glureau commented Feb 7, 2025

There's nothing special in those classes, you can have a look here, they are just marked with the equivalent of @Serializable.
Subclasses are not aware they are used in one or multiple oneOf. It's when you want to build your configuration (similarly to Json { pretty=true }, likely at app level) that you can register the oneOf declarations. (And that registration could be done automatically indeed.)

@pdvrieze
Copy link
Contributor

pdvrieze commented Feb 7, 2025

I'm not sure that serialization supports repeatable annotations (it didn't, but with k2 that may no longer be an issue), that would work to replace Pair (as you can have an array of them)

@sandwwraith
Copy link
Member

Thanks everyone for the discussion. It seems there are no action points to make here, as alternative designs require annotation definitions that are not currently supported by kotlinx.serialization. I'll leave this open as a reference though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants