Skip to content

Conversation

mbergo
Copy link

@mbergo mbergo commented Apr 18, 2025

This commit adds helper functions for EndpointSlice consumers to make it easier to transition from Endpoints to EndpointSlices. The new package provides:

  1. EndpointSliceConsumer - Core component that tracks EndpointSlices and provides a unified view of endpoints for a service
  2. EndpointSliceInformer - Informer-like interface for EndpointSlices
  3. EndpointSliceLister - Lister-like interface for EndpointSlices

These helpers handle the complexity of merging multiple slices for the same service and deduplicating endpoints that might appear in multiple slices.

Benefits:

  • Easier migration from Endpoints to EndpointSlices with familiar interfaces
  • Simplified handling of multiple slices without manual merging and deduplication
  • Improved performance by leveraging the scalability of the EndpointSlice API
  • Consistent view of endpoints even as they move between slices

Fixes #124777

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?


Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


This commit adds helper functions for EndpointSlice consumers to make it easier to transition from Endpoints to EndpointSlices. The new package provides:

1. EndpointSliceConsumer - Core component that tracks EndpointSlices and provides a unified view of endpoints for a service
2. EndpointSliceInformer - Informer-like interface for EndpointSlices
3. EndpointSliceLister - Lister-like interface for EndpointSlices

These helpers handle the complexity of merging multiple slices for the same service and deduplicating endpoints that might appear in multiple slices.

Benefits:
- Easier migration from Endpoints to EndpointSlices with familiar interfaces
- Simplified handling of multiple slices without manual merging and deduplication
- Improved performance by leveraging the scalability of the EndpointSlice API
- Consistent view of endpoints even as they move between slices

Fixes kubernetes#124777

Signed-off-by: Mad Bergo <[email protected]>
@k8s-ci-robot
Copy link
Contributor

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. labels Apr 18, 2025
@k8s-ci-robot
Copy link
Contributor

Please note that we're already in Test Freeze for the release-1.33 branch. This means every merged PR will be automatically fast-forwarded via the periodic ci-fast-forward job to the release branch of the upcoming v1.33.0 release.

Fast forwards are scheduled to happen every 6 hours, whereas the most recent run was: Fri Apr 18 13:34:58 UTC 2025.

@k8s-ci-robot k8s-ci-robot added do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-kind Indicates a PR lacks a `kind/foo` label and requires one. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 18, 2025
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot
Copy link
Contributor

Hi @mbergo. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added needs-priority Indicates a PR lacks a `priority/foo` label and requires one. sig/network Categorizes an issue or PR as relevant to SIG Network. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Apr 18, 2025
@k8s-ci-robot k8s-ci-robot requested review from aroradaman and tnqn April 18, 2025 19:34
@mbergo
Copy link
Author

mbergo commented Apr 19, 2025

/cc @kubernetes/sig-network-pr-reviews

@k8s-ci-robot
Copy link
Contributor

@mbergo: GitHub didn't allow me to request PR reviews from the following users: kubernetes/sig-network-pr-reviews.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @kubernetes/sig-network-pr-reviews

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot
Copy link
Contributor

@mbergo: Reiterating the mentions to trigger a notification:
@kubernetes/sig-network-pr-reviews

In response to this:

/cc @kubernetes/sig-network-pr-reviews

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@mbergo
Copy link
Author

mbergo commented Apr 19, 2025

maybe like this. cc: @kubernetes/sig-network-pr-reviews

@k8s-ci-robot
Copy link
Contributor

@mbergo: Reiterating the mentions to trigger a notification:
@kubernetes/sig-network-pr-reviews

In response to this:

maybe like this. cc: @kubernetes/sig-network-pr-reviews

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

// deduplicating endpoints from all EndpointSlices for the service.
func (l *endpointSliceNamespaceLister) GetEndpoints(serviceName string) ([]discovery.Endpoint, error) {
// Get all EndpointSlices for the service
_, err := l.Get(serviceName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is AI-generated, right?

/hold

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

first was manually generated but I used my own AI from the IBM days for a review which made 3 lines fixes. Can you tell what you saw to think that? Actually it did nothing that I would haven't done nowadays or when I was at Google working at this or Borg. But I got interested now, since it was trained only on my codebase and my style with my own constraints of 23 years doing this.

Nevertheless, nice catch @danwinship .

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never the less, I have a 4h video for you that I spent on this, just did not get any suggestions for modifications. I did not think my AI review more than a lint or LSP wrote by me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, was worried that since you are a new contributor (ie, not a member of the kubernetes github org), and this has AI fingerprints, that maybe that meant you just let the AI write the whole thing and you didn't actually understand any of the code and wouldn't be able to usefully respond to code review. Glad that's not the case.

(The particular reason I pointed out this line is that it's a no-op; it gets the EndpointSlices but discards the result, and then a few lines below calls l.consumer.GetEndpoints() which then re-fetches the EndpointSlices again.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh... ok, looking closer, this is pretty weird; it uses the underlying lister to get the slices, feeds them into the consumer via OnEndpointSliceAdd, and then asks the consumer to list the slices...

So this wouldn't actually work right, since nothing deletes slices from the consumer when they get deleted, so every call to GetEndpoints would end up returning both current and old endpoints.

We shouldn't actually be using an underlying discoverylisters.EndpointSliceNamespaceLister here; we should implement API directly via an underlying consumer/informer.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Weird, it is one of the few parts I remember the slice was deleted, my bad, I will take a closer look

Copy link
Author

@mbergo mbergo Apr 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method calls l.Get(serviceName) to retrieve all EndpointSlices associated with the given service
The results are discarded (note the _), but the error is captured
This is likely because the Get() call is used to populate the consumer's cache, as seen in the Get() method implementation where it calls l.consumer.OnEndpointSliceAdd(slice)

but

However, there's a potential issue in the selected code:
The results from l.Get() are discarded with _
If l.Get() returns an error, it will be returned before the cache is populated
This means we might miss endpoints if there's a temporary error

therefore:

func (l *endpointSliceNamespaceLister) GetEndpoints(serviceName string) ([]discovery.Endpoint, error) {
// Get all EndpointSlices for the service
slices, err := l.Get(serviceName)
if err != nil {
return nil, err
}

// Double check the cache was populated
for _, slice := range slices {
    l.consumer.OnEndpointSliceAdd(slice)
}

// Get the merged endpoints from the consumer
serviceNN := types.NamespacedName{
    Namespace: l.lister.Namespace(),
    Name:      serviceName,
}
return l.consumer.GetEndpoints(serviceNN), nil
}

The main purpose of this code is to provide a unified and efficient way to handle Kubernetes service endpoints at scale. Here's why it exists:

Primary Purpose:

Solves the scalability limitations of the older Endpoints API by breaking large endpoint sets into smaller, manageable slices
Provides a clean abstraction layer for consuming EndpointSlices without dealing with the complexity of multiple slices per service

cc: @danwinship

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 21, 2025
@aojea
Copy link
Member

aojea commented Apr 21, 2025

Adding libraries to staging repos has demonstrated that is not the best place , since people expect these apis to be frozen and the expierence is that there is no control across the versions breaking compatbility, making people to go through a lot of pain to revendor the code ... I do not think staging repos are the best places for this kind of things unless you are willing to provide stable APIs on these helpers

@mbergo
Copy link
Author

mbergo commented Apr 22, 2025

Adding libraries to staging repos has demonstrated that is not the best place , since people expect these apis to be frozen and the expierence is that there is no control across the versions breaking compatbility, making people to go through a lot of pain to revendor the code ... I do not think staging repos are the best places for this kind of things unless you are willing to provide stable APIs on these helpers


I think there is a more in deep problem about what you said, could not all this be due the release rate you guys think is maintainable? Let's be a little a bit audacious and compare this codebase with the Kernel (linux), from also an early contributor I think Linux windows of merge patches are just agile. And this after well matured guardrails.

Without long debates, what am I saying. I came to help and take a look to see why k8s adoption is crashing from when I started with it. I am open to suggestions to improve the PR, but I only got criticism?

Any suggestions for a better approach, or at least a reason for why do you think people expect that from your project? cc: @aojea

@mbergo
Copy link
Author

mbergo commented Apr 22, 2025

Overall, you guys are showing me one thing, might be time for investing in a good K8s alternative, it is becoming hell to work with it with so subtle bugs due all those approaches you guys are taking that makes almost impossible to see if it is one error of the chaotic configuration architecture or a small bug for one of many sharing responsibilities components.

Release after release, no palpable, user friendly improvements.

@danwinship
Copy link
Contributor

re staging vs not staging, Antonio and I talked about this. I think for the initial version of this API, we should put it internal to k8s.io/kubernetes, and wait a few releases to make sure we're happy with the API before moving it to a staging repo where it would become harder to change the API if it turned out there were problems


// EndpointSliceConsumer provides a unified view of endpoints for services
// across multiple EndpointSlice objects.
type EndpointSliceConsumer struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there's any advantage in having this type be exposed; it should just be an implementation detail of the informer and lister.

(I'm not sure we really need the informer and lister as separate types either, and if you squashed them together, then there's no need for the Consumer as its own type at all.)

result := make([]*discovery.EndpointSlice, 0, len(slices))
for _, slice := range slices {
result = append(result, slice.DeepCopy())
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Standard informer/lister semantics are that they don't copy the data, and just document that you aren't allowed to modify the return values. That makes them a little bit tricky, but it's much better for memory usage...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neat approach, I like it!

// Sort slices by name for consistent results
sort.Slice(result, func(i, j int) bool {
return result[i].Name < result[j].Name
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

likewise, don't do this; if the caller wants them sorted, they can do that, but they may not care at all

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍


// GetEndpoints returns all endpoints for a service, merging and deduplicating
// endpoints from all EndpointSlices for the service.
func (c *EndpointSliceConsumer) GetEndpoints(serviceNN types.NamespacedName) []discovery.Endpoint {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this will end working as an API, because it discards the Ports... (In most cases, all of the EndpointSlices have identical ports, but if you're upgrading pods and moving them from an old port to a new port, then there will be one slice with the pods using the old port number and one slice with the pods using the new port number, and they'd have different Ports and you need to know which Endpoints go with which Ports.)

(Also, again, this function necessarily has to do a bunch of memory allocations, whereas many EndpointSlice consumers would be able to just iterate over the return value from GetEndpointSlices and do what they need to do without needing equivalent allocations.)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me do a major refactor with that in mind.

err = fmt.Errorf("expected EndpointSlice name and namespace to be set: %v", endpointSlice)
}
return types.NamespacedName{Namespace: endpointSlice.Namespace, Name: serviceName}, endpointSlice.Name, err
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not having a discovery.LabelServiceName isn't an "error", it just means the slice doesn't correspond to a Service (so we don't need to track it, but we don't need to log anything).

endpointSlice.Namespace == "" || endpointSlice.Name == "" is not possible for an object that came from the apiserver, so you don't need to worry about that.

)

// This example demonstrates how to use the EndpointSliceConsumer directly.
func Example_directUsage() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't generally mix example code in with implementation code. I think we don't generally provide examples at all...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

old habits die hard...sorry


// nodeName is the name of the node this consumer is running on.
// Used to determine if an endpoint is local.
nodeName string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caring about local endpoints only is specific to kube-proxy and shouldn't be part of the generic API.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any special reason?

Comment on lines +177 to +181
// If we already have this endpoint, only replace it if the existing one
// is not local but the new one is
existingEp, exists := endpointMap[key]
isLocal := endpoint.NodeName != nil && *endpoint.NodeName == c.nodeName
existingIsLocal := exists && existingEp.NodeName != nil && *existingEp.NodeName == c.nodeName
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(we don't want any of this)

// NewEndpointSliceInformer creates a new EndpointSliceInformer.
func NewEndpointSliceInformer(
informerFactory informers.SharedInformerFactory,
nodeName string,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above, we don't want nodeName in the API.

Though, there may be a use case for having Namespace-specific informers/listers rather than watching all namespaces.

// deduplicating endpoints from all EndpointSlices for the service.
func (l *endpointSliceNamespaceLister) GetEndpoints(serviceName string) ([]discovery.Endpoint, error) {
// Get all EndpointSlices for the service
_, err := l.Get(serviceName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh... ok, looking closer, this is pretty weird; it uses the underlying lister to get the slices, feeds them into the consumer via OnEndpointSliceAdd, and then asks the consumer to list the slices...

So this wouldn't actually work right, since nothing deletes slices from the consumer when they get deleted, so every call to GetEndpoints would end up returning both current and old endpoints.

We shouldn't actually be using an underlying discoverylisters.EndpointSliceNamespaceLister here; we should implement API directly via an underlying consumer/informer.

@mbergo
Copy link
Author

mbergo commented Apr 22, 2025

I can try to get the entire code commented @danwinship, I am not new, just don't have a Google email anymore. So, I try to do everything to help with this, since it took us years to get from the ground up to replace what is still there.

I am currently working on three codebases, counting this ... some Rust practices might mingle with Go, and you see some weird stuff, sorry about that, but I hope that you got that I used as a linter, not as an AI, since you happened to point out the one place I had a mistake and used, I asked. But there are tools to read code and tell if it is AI-generated. I don't know of any Neovim co-pilot that can work in such a codebase. If you know, please let me know, as it would save me some time..

Overall, between saying that wasn't supposed to be out here and saying this doesn't work, should I mind fix and explain what is doing what and why, or should I just close this PR? In 23 years, I've had my share of gatekeepers, and I need to choose my battles. Please let me know what your intention is, sir,, and I will give one more try... or not.. cc @danwinship

@danwinship
Copy link
Contributor

I can try to get the entire code commented

I think the existing level of commenting is fine.

you see some weird stuff, sorry about that, but I hope that you got that I used as a linter, not as an AI

Yeah, I had been dealing with someone else's bad AI-generated PR somewhere else earlier this week, and then some weird inexplicable spam "bugfix" PR in k/k just before yours. And so then when two medium-sized PRs show up from a previously-unknown contributor, which mostly involve the sorts of "recombining existing bits of code in new ways" sort of stuff that AI is good at, I was suspicious, and worried that if I spent time reviewing the PR it would be wasted because the submitter didn't actually understand the code and wouldn't be able to fix it.

(And in a perfect world, I would have been more charitable, but in a perfect world we wouldn't have people sending us spam PRs to bump up their github stats so I wouldn't have had any reason to be suspicious in the first place.)

Overall, between saying that wasn't supposed to be out here

If you mean Antonio's comment, that's just about where this code goes initially (though his comment wasn't terribly clear about that). We do need the code somewhere.

and saying this doesn't work

GetEndpoints returning a []discoveryv1.Endpoint doesn't work, but GetEndpointSlices does.

Basically, we want an informer and lister that operate on []*discoveryv1.EndpointSlice rather than *discoveryv1.EndpointSlice.

Your EndpointChangeHandler and endpointSliceNamespaceLister.Get are mostly right from an external API perspective, but internally they need to (1) always return pointers to the underlying informer's EndpointSlices rather than copying them, and (2) always(*) keep the Service-name-to-slice-array cache up-to-date so they never have to do an O(n) scan of all of the EndpointSlices.

(*) maybe not before the initial cache sync

should I mind fix and explain what is doing what and why

Yes, please, we do want this code, it just needs to be right.

@danwinship
Copy link
Contributor

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 24, 2025
@k8s-ci-robot
Copy link
Contributor

Keywords which can automatically close issues and at(@) or hashtag(#) mentions are not allowed in commit messages.

The list of commits with invalid commit messages:

  • 354c12e Add EndpointSlice consumer helper functions

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/contains-merge-commits Indicates a PR which contains merge commits. label Jul 15, 2025
@k8s-ci-robot
Copy link
Contributor

Adding label do-not-merge/contains-merge-commits because PR contains merge commits, which are not allowed in this repository.
Use git rebase to reapply your commits on top of the target branch. Detailed instructions for doing so can be found here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: mbergo
Once this PR has been reviewed and has the lgtm label, please assign thockin for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@aroradaman
Copy link
Member

@mbergo I think you accidentally pushed a merge commit to this branch.

@mbergo
Copy link
Author

mbergo commented Jul 15, 2025

Sorry @aroradaman, I will properly rebase and address the latest comment. Does it still apply?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/contains-merge-commits Indicates a PR which contains merge commits. do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. do-not-merge/needs-kind Indicates a PR lacks a `kind/foo` label and requires one. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/network Categorizes an issue or PR as relevant to SIG Network. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

add EndpointSlice consumer helper functions
5 participants