-
Notifications
You must be signed in to change notification settings - Fork 114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Daemon redesign - using controller-runtime #788
base: master
Are you sure you want to change the base?
Conversation
Thanks for your PR,
To skip the vendors CIs, Maintainers can use one of:
|
9f5207e
to
7d4e946
Compare
e17b584
to
ddc70f2
Compare
929886d
to
9377404
Compare
79cc69b
to
dc6eb11
Compare
dc6eb11
to
50ad79a
Compare
50ad79a
to
f369b65
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few minor comments. This PR hugely improves the config-daemon shape.
c := current.Status.DeepCopy().Interfaces | ||
d := desiredNodeState.Status.DeepCopy().Interfaces | ||
for idx := range d { | ||
// check if it's a new device |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// check if it's a new device | |
if idx >= len(c) { | |
return true, nil | |
} | |
// check if it's a new device |
I don't know if it can ever happen (maybe if a new device is hot-plugged between reconcile loops). But it's safer to check for slice length
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but in theory (not sure if it's possible) you can have a new device like hot-unplug and after that hot-plug so the len will be the same but the device is not no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in both case, the if idx >= len(c) {
can save a nil dereferencing. right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
partial review
@@ -108,14 +108,7 @@ type DrainStateAnnotationPredicate struct { | |||
} | |||
|
|||
func (DrainStateAnnotationPredicate) Create(e event.CreateEvent) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WDYT about removing the Create fn ?
the default of predicate.Funcs is true.
is object being nil a real concern here ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the concern is not nil. without this when a new node is added to the cluster the reconcile will not get called without this.
and if the reconcile is not created no one adds the labels to make the drain possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
without this when a new node is added to the cluster the reconcile will not get called without this
are you sure ? i thought by default it will call reconcile since predicate.Funcs contains implementation that will return true. which means to enqueue event for reconcile.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I give it a try and I needed to add it. without it I was not able to get the create calls :(
I can give it another try
// creates the in-cluster config | ||
config, err = rest.InClusterConfig() | ||
func getOperatorConfig(kClient runtimeclient.Client) (*sriovnetworkv1.SriovOperatorConfig, error) { | ||
// Init feature gates once to prevent race conditions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: unrelated comment
|
||
refreshCh := make(chan daemon.Message) | ||
defer close(refreshCh) | ||
func UseKubeletKubeConfig() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does it need to start with upper case ?
|
||
// Setup reconcile loop with manager | ||
if err = dm.SetupWithManager(mgr); err != nil { | ||
setupLog.Error(err, "unable to create setup daemon manager for SriovNetworkNodeState") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: unable to setup daemon with manager for SriovNetworkNodeState
pkg/daemon/daemon.go
Outdated
} | ||
|
||
type Daemon struct { | ||
type DaemonReconcile struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DaemonReconciler ? or better yet, NodeReconciler ?
pkg/daemon/daemon.go
Outdated
ns := &sriovnetworkv1.SriovNetworkNodeState{} | ||
// init openstack info | ||
if vars.PlatformType == consts.VirtualOpenStack { | ||
ns, err = dn.HostHelpers.GetCheckPointNodeState() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we limit the usage of nodestate from checkpoint to just this block ? i.e define a new var.
it seems odd that other platform types we dont load from checkpoint during init (which initializes InitialState
global var
|
||
// save init state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: // save init state if it does not exist on host
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from the code I can see that we always write the file. I am missing something?
log.Log.Error(err, "nodeStateSyncHandler(): Failed to fetch node state", "name", vars.NodeName) | ||
return err | ||
reqLogger.Error(err, "failed to check systemd status unexpected error") | ||
return ctrl.Result{}, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we dont want to fail here ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we return an error it will get stuck in a loop as if the result for the systemd operator return an error we don't retry to not put the node into a bootloop
pkg/daemon/daemon.go
Outdated
<-dn.syncCh | ||
return nil | ||
} | ||
return ctrl.Result{RequeueAfter: 30 * time.Second}, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: lets define requeueTime as a const
pkg/daemon/daemon.go
Outdated
return err | ||
} | ||
postNetworkServiceEnabled, err := dn.HostHelpers.IsServiceEnabled(systemd.SriovPostNetworkServicePath) | ||
// if there are no node drifted changes, and we are on the latest applied policy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: // if there are no host state drift changes, ...
pkg/daemon/daemon.go
Outdated
} | ||
reqDrain = reqDrain || systemdConfModified || !exist | ||
// require reboot if drain needed for systemd mode | ||
reqReboot = reqReboot || systemdConfModified || reqDrain || !exist |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: you can drop : || systemdConfModified
and || !exist
they are checked in reqDrain
pkg/daemon/daemon.go
Outdated
log.Log.Error(err, "nodeStateSyncHandler(): failed to write supported nic ids file for systemd mode") | ||
return err | ||
} | ||
func (dn *DaemonReconcile) checkSystemdStatus(ctx context.Context, desiredNodeState *sriovnetworkv1.SriovNetworkNodeState) (*systemd.SriovResult, bool, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add docstring explaining the returned args ?
pkg/daemon/daemon.go
Outdated
|
||
// update the lastAppliedGeneration | ||
dn.lastAppliedGeneration = desiredNodeState.Generation | ||
return ctrl.Result{RequeueAfter: 30 * time.Second}, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: same as previous comment. lets use a const
pkg/daemon/daemon.go
Outdated
// return sriovResult, err | ||
//} | ||
|
||
// TODO: check if we need this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i believe we need.
in case systemd services were enabled, and there is error in sriovResult, we want to reflect that in status and stop reconcile. else we end up in a reboot loop no ? (which we dont want here)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ykulazhenkov for some more feedback on this bit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it was moved out of this function so we should be ok I remove this
once we get this in, i think we should move the node controllers under |
@@ -7,7 +7,7 @@ contents: | | |||
|
|||
[Service] | |||
Type=oneshot | |||
ExecStart=/var/lib/sriov/sriov-network-config-daemon -v 2 --zap-log-level 2 service --phase pre | |||
ExecStart=/var/lib/sriov/sriov-network-config-daemon service --phase pre |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this means reboot on update ? since systemd file changes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes it will make a reboot one time
hack/run-e2e-conformance.sh
Outdated
@@ -7,4 +7,4 @@ GOPATH="${GOPATH:-$HOME/go}" | |||
JUNIT_OUTPUT="${JUNIT_OUTPUT:-/tmp/artifacts}" | |||
export PATH=$PATH:$GOPATH/bin | |||
|
|||
${root}/bin/ginkgo -output-dir=$JUNIT_OUTPUT --junit-report "unit_report.xml" -v "$SUITE" -- -report=$JUNIT_OUTPUT | |||
${root}/bin/ginkgo --timeout 3h -output-dir=$JUNIT_OUTPUT --junit-report "unit_report.xml" -v "$SUITE" -- -report=$JUNIT_OUTPUT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this change related ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nope I open a different PR for that :)
4ecc95a
to
80564fa
Compare
Signed-off-by: Sebastian Sch <[email protected]>
Signed-off-by: Sebastian Sch <[email protected]>
Signed-off-by: Sebastian Sch <[email protected]>
Signed-off-by: Sebastian Sch <[email protected]>
Signed-off-by: Sebastian Sch <[email protected]>
Signed-off-by: Sebastian Sch <[email protected]>
Signed-off-by: Sebastian Sch <[email protected]>
It's only a one time run script so it should not be a problem to run with debug logs. Signed-off-by: Sebastian Sch <[email protected]>
Signed-off-by: Sebastian Sch <[email protected]>
Signed-off-by: Sebastian Sch <[email protected]>
DNM before squash Signed-off-by: Sebastian Sch <[email protected]>
80564fa
to
00587fb
Compare
Hi @zeeke @adrianchiris @ykulazhenkov can you please make another round of review here? |
No description provided.