-
Notifications
You must be signed in to change notification settings - Fork 477
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature request] Topology Spread Constraints & Per-Deployment Affinity #856
Comments
I think what we are looking for here is more description of the the actual behaviour vs. the desired behaviour. What is happening now in the cluster, why is that bad, and what would make that better? Have you looked into Profiles for affinity yet? I didn't see you mention it in your issue. Regards, Alex |
Hey @alexellis Right now I have no guarantee of regional, zonal, or nodal high availability on services. We currently use EKS with spot instances preferred, and part of how EKS works is that a node might be gracefully drained and cordoned at any time when the node's underlying spot instance is reclaimed. As such, making sure multiple pods for each service exist and that they're not on the same node (and preferably not in the same zone) is a strong desire. We currently do this through Topology Spread Constraints. Typically you need to set something like: labelSelector:
app: gateway Per app, and with the current way affinity is exposed, I don't think I have enough expressivity to easily select OF components and specify their desired affinities, especially if they differ in some way now or down the road.
Interesting; I never really looked into profiles, as the CRD exposes up most of what I want, and since it's easy enough to write a helm chart with values per-environment, this really hasn't come up. Now that I look at profiles, I guess I'm wondering why I'd use them instead of a CRD. One use case I can think of is reasonable defaults for a set of functions, but our helmfile setup already has defaults on top of a custom chart, which already has defaults. Don't really think a 3rd layer would be particularly advantageous here at this time, and would instead probably prefer what profiles exposes exposed in the CRD. On that note, I'm starting to wonder if it would be easier for the CRD to simply expose a deployment spec or something, because manually exposing up every option that kubernetes exposes to a deployment seems like a never-ending chore. |
I don't follow the comparison of "CRD" vs Profiles. You can use the OpenFaaS REST API or Function CRD to create your function, there's an annotation to tie that function to a Profile. Profiles enable affinity/anti-affinity rules. Clarify for me: Is this request for functions or the core services? Do you also run the core services on spot instances? |
To me, both just look like different ways to specify certain things in k8s config. I've only given a cursory look to profiles, but they look to me like a way to apply multiple sets of certain k8s config to multiple functions, similar to having multiple queue workers for different "types" of functions. Personally, I don't really see the need for profiles as I understand them now, as if the currently exposed options:
were just part of the deployment created when the CRD is applied, it doesn't seems like the profiles would add anything. Those are all just normal k8s config options, just like That, and if, for example, there was some kind of overlap between what the CRD for the function would expose, and what the profiles would expose. For example, if the CRD let you specify a
This is for the core services, because currently we don't expect functions to be always present (we expect to use scale-to-zero as much as reasonably possible for a given workload), and as such are not particularly concerned with the spread of the pods at this time. It will probably come, and would definitely be better to have than not, but right now the focus is core services.
Yes; with the exception of OpenFaaS, all of our current workloads are highly available and individual pods are "stateless", i.e. assuming that a pod handling a workload is not interrupted by a The question/issue that I have is that in regards to queues in OpenFaaS, I've noticed in local testing that terminating certain pods seems to cause queues to be cleared while they should still have a workload. That's definitely not ideal, and I would like to know if that is expected given the current implementation/arch, and if so, whether or not persistent queues could be implemented. I've only recently begun to look into that, but I expect to create a separate issue for that once I do a little more digging. Assuming that a queue is in-memory, and that termination of a specific core service, graceful or not, currently clears a queue, then perhaps a PVC or the like could be used to flush the contents of a queue during graceful termination, or some kind of "hand-off", where pod A (draining) waits for pod B (pending/running) to come online, and once available, pod A sends its queue to pod B. |
ftr I could probably pretty easily add these to the chart, as long I know whether or not we'd like to use global defaults or something. |
just a small thing while i read and think about the the thread. The comment about the queue persistence probably deserves its own issue because it will require us to modify this
file and sql , per this doc https://docs.nats.io/nats-streaming-server/configuring/persistence but both of these options come with their issues that need to be carefully considered and we would need to provide a lot more documentation about error states
|
Maybe a cluster would solve for both NATS high availability and this shared state? Assuming you could mount a file via a PVC and let RAFT take care of the rest, would allow for multiple NATS pods at once, and be persistent, killing two birds with one stone. A SQL database running in k8s would probably have the same issue, meaning I'd either have to set up an external k8s database, or it would need some kind of k8s-level persistence, so probably just more config for the same end value. Although, you could theoretically create a db in the OF chart, allowing you to turn it off and change the endpoint, port, and auth. That would basically give you in-memory storage by default (the db is lost upon termination), and persistence if you want it (the db is remote). PVC and cluster sounds simpler, though. |
That said, using it in AWS in a (potentially) multi-regional, multi-zonal, multi-nodal spread, EFS or FSx are probably the best choices, as both appear to support multi-zone/region use cases. People installing a NATS cluster and expecting multi-zonal persistence will likely need something similar, which is potentially an extra wrench to throw in, but we would probably be able to handle the persistence just fine with EFS; people say "it's kinda slow" at lower workloads, but I don't think that's necessarily a big deal for something like a queue. I imagine most people who need both HA and multi-AZ persistence have a good storageclass option at their disposal from their preferred k8s implementation/cloud provider. |
I have broken out the NATS issues discussed into two separate issues, persistence and high availability; that way we can keep discussion focused to Topology Spread and Affinity. |
@alexellis if my proposal of just exposing up different For the sake of keeping the existing pattern: # ./values.yaml
affinity: {...} # [current] global "default"
topologySpreadConstraints: # [proposed] global "default"
- {...}
- {...}
gateway:
affinity: {...} # [proposed] overrides global
topologySpreadConstraints: # [proposed] overrides global
- ...
labelSelector:
app: gateway
- ...
labelSelector:
app: gateway
faasIdler:
affinity: {...} # [proposed] overrides global
topologySpreadConstraints: # [proposed] overrides global
- ...
labelSelector:
app: faas-idler
- ...
labelSelector:
app: faas-idler |
The additional context is useful, but please don't take us not getting to requests within 24 hours meaning that we are giving a silent approval on the design. |
Understood. Not trying to pressure you, I'd just rather be a little noisy in a (hopefully) useful way than quiet in an unuseful way; not trying to signal that I think the response is slow. |
@kevin-lindsay-1 your proposal here #856 (comment) is this to apply just to the openfaas components (gateway, provider, nats, etc) or is this a global value to apply to all functions? it looks like it is just the components, right? |
@LucasRoesler only the core OF components in That, and most of our functions at this time are primarily async. |
in that case, it seems like we could treat this the same as affinity, tolorations, and nodeSelector. These are empty and omitted by default, but the template can provide the required values to insert them where needed. This is essentially what you proposed here #856 (comment) |
@LucasRoesler yep. only thing is that I don't think a single global rule will work for either TSC or Affinity, as you need to select the pods, which typically needs a As it stands, I'm not even sure that I could really use the existing global affinity rule; assuming that you need a selector of some sort that differs per deployment, I don't even think you could make a single affinity rule that would work. If something exists that would make me not need to have different affinity rules per service, I would consider it, but as far as I'm aware something like that does not exist, so I'm not even sure what I could use a global affinity rule for. The same gotcha applies to TSCs, to my knowledge. |
yeah, but perhaps we could automatically add the deployment name as a label to global constraint or something like that to make writing a generic constraint a bit easier and less repetitive. So you can write a constraint, then the final constraint will be what was written + one additional label. So, if the global TSC is topologySpreadConstraints:
- maxSkew: 1
topologyKey: zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
foo: bar then the helm chart generates topologySpreadConstraints:
- maxSkew: 1
topologyKey: zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
foo: bar
app: gateway for the gateway deployment, and likewise for the others. Having a Deployment specific override would still be useful though. |
I would be fine with the global rules automatically adding their respective label selectors; for the most part there's no real reason to have the overrides if those work appropriately, as most, if not all of the core services would have the same affinity/spread needs. Whether or not that's a breaking API change for affinity is a different question. This is also possibly out of scope for this issue, but It may also be useful to have default preferred affinity between certain core services, so that for example the core services might be preferred to be scheduled on the same node where possible, rather than spread throughout a cluster unnecessarily. Might throw a wrench in, though, because for example I have an external NATS cluster, which may need its own preferences covered. Should theoretically be able to be scheduled with the other pods, though, just like anti-affinity/topology spread. I could probably set the NATS cluster's affinity to match OF pods, though, so may be a moot point. |
Just some book keeping here, for future reference etc. This is related but different from #815 which is asking for the same TSC configuration but at the pod level |
Kevin mentioned that this became more important since his team moved to using spot instances. I know that other openfaas users make use of spot instances, or GKE's preemptive VMs. The spread or affinity rule would schedule the three replicas of the gateway onto three separate nodes before adding duplicate replicas to the same hosts again, so that if a VM is reclaimed by the cloud provider, there are other replicas available to serve requests whilst a new host is provisioned.
So, is there a way we can achieve this result without making the chart much more verbose and complex to manage? i.e. with a single rule that templates the deployment labels etc into the right place? |
I do not currently have a need for overrides if the label selectors were automatically added per deployment. I would also consider a default affinity rule for services to affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
preference:
matchExpressions:
- key: app
operator: In
values:
- openfaas This rule basically says "I prefer to be on a node with openfaas components". and I think the following is basically the same: affinity:
podAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- openfaas
topologyKey: kubernetes.io/hostname I believe the current non-templated labels |
I can see labels, so could you help me understand which labels are not being added? Perhaps you could show an example of what you got and what you expected?
apiVersion: apps/v1
kind: Deployment
metadata:
annotations:
deployment.kubernetes.io/revision: "1"
meta.helm.sh/release-name: openfaas
meta.helm.sh/release-namespace: openfaas
creationTimestamp: "2021-11-01T17:50:20Z"
generation: 1
labels:
app: openfaas
app.kubernetes.io/managed-by: Helm
chart: openfaas-8.1.3
component: gateway
heritage: Helm
release: openfaas
spec:
progressDeadlineSeconds: 600
replicas: 1
revisionHistoryLimit: 10
selector:
matchLabels:
app: gateway
strategy:
rollingUpdate:
maxSurge: 25%
maxUnavailable: 25%
type: RollingUpdate
template:
metadata:
annotations:
prometheus.io/port: "8082"
prometheus.io/scrape: "true"
creationTimestamp: null
labels:
app: gateway I can see |
@alexellis I'm referring to what @LucasRoesler was thinking on #856 (comment). Effectively, the chart could have a global and not have the need for an override if each component's affinity/tsc rules templated in their rules. For example:
Right now, I would probably not have a need for per-component overrides, as long as the fact that each component would be referring to itself doesn't conflict with other possible affinity/tsc settings a user might want to reasonably make, which I can't really think of off the top of my head. Seems like most use cases for spread with be trying to spread components out, while affinity rules will be trying to keep core components together. |
Given the amount of conversation it's taking to understand the problem, I think your original suggestion may fit better of having one of these per every deployment in the chart. It will default to off, but can be enabled if required. values.yaml will have your overrides for images and configuration gateway:
image: ghcr.io/openfaas/gateway:latest values-topology.yaml, would look like this: gateway:
affinity: {...} # [proposed] overrides global
topologySpreadConstraints: # [proposed] overrides global
- ...
labelSelector:
app: gateway
- ...
labelSelector:
app: gateway
faasIdler:
affinity: {...} # [proposed] overrides global
topologySpreadConstraints: # [proposed] overrides global
- ...
labelSelector:
app: faas-idler
- ...
labelSelector:
app: faas-idler To make this happen, do we just need to output whatever is put into the chart into the deployment and quote it, or are any additional more complex templating rules needed for dynamic fields? |
Should be as simple as passing it straight through. And yeah, keeping them separate is definitely simpler. There's an existing global rule for affinity, not sure if that would need to stick around. |
It's fairly normal for k8s deployments that live in a multi-zone cluster to have topology constraints or affinity rules, however it's fairly normal to have an affinity rule per deployment, as the label selectors for pods are typically different per-deployment.
Expected Behaviour
Be able to set anti-affinity and topology spread constraints per deployment.
Current Behaviour
Cannot set affinity per deployment, and topology spread constraints are not in the chart.
Are you a GitHub Sponsor (Yes/No?)
Check at: https://github.com/sponsors/openfaas
List All Possible Solutions and Workarounds
Affinity is currently a global option for what appears to be all deployments. Per-deployment overrides is a reasonable option.
Topology Spread Constraints are not currently in the chart, and I don't think a global configuration option would really even work, as the
labelSelector
would need to be different per-deployment.Which Solution Do You Recommend?
Make overrides for affinity and make topology spread configurable, not implementing a global variant.
Steps to Reproduce (for bugs)
Context
Trying to have HA for
faas-netes/openfaas
.Your Environment
FaaS-CLI version ( Full output from:
faas-cli version
):0.13.13
Docker version
docker version
(e.g. Docker 17.0.05 ):20.10.8
What version and distriubtion of Kubernetes are you using?
kubectl version
server v1.21.3
client v1.22.2
Operating System and version (e.g. Linux, Windows, MacOS):
MacOS
Link to your project or a code example to reproduce issue:
What network driver are you using and what CIDR? i.e. Weave net / Flannel
The text was updated successfully, but these errors were encountered: