-
Notifications
You must be signed in to change notification settings - Fork 661
Guest agent sometimes doesn't start (or is unreachable from host agent) #2064
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
FWIW, I've been able to (sometimes) reproduce this with the default (Ubuntu-based) image. It's just using a script to start and stop in a loop: #!/usr/bin/env bash
set -o errexit
machine="$(limactl list --format '{{.Name}}')"
while true; do
limactl list
status="$(limactl list "${machine}" --format '{{ .Status }}')"
case "$status" in
Stopped) limactl --debug start "${machine}";;
Running) limactl stop "${machine}";;
Broken) exit;;
esac
sleep 1
done Eventually This was at 04e3886. The VM was created with stderr:
sudo journalctl -b -u lima-guestagent
|
Both @mook-as and I have run #!/usr/bin/env bash
mkdir -p bisect
make binaries
machine="$(limactl list --format '{{.Name}}')"
commit="$(git log -1 --format=%H)"
count=0
errors=0
while [[ "$count" -lt 100 ]]; do
limactl list
printf "\e[0;1;32mcount=%d errors=%d\e[0m\n" "${count}" "${errors}"
status="$(limactl list "${machine}" --format '{{ .Status }}')"
case "$status" in
Stopped)
if limactl --debug start "${machine}" >"bisect/${commit}.log" 2>&1; then
count=$((count+1))
else
if grep --quiet 'guest agent does not seem to be running' "bisect/${commit}.log"; then
exit 1
fi
errors=$((errors+1))
if [[ "$errors" -gt 10 ]]; then
exit 125
fi
fi
;;
Running) limactl stop "${machine}" ;;
esac
done It needs Bisecting between v0.18.0 and v0.19.0. For both of us (on M1 machines) the bisect identified hostagent: avoid requiring I tried adding @AkihiroSuda Any ideas? PS: Yes, the bisect script is explicitly looking for the error message that was introduced in the PR that was eventually found. But the old code had never indicated that there was a problem communicating with the guestagent. Note that the guestagent is running; it is the code trying to communicate with it that is failing. |
@jandubois Are you able to view guestagent logs ?? Any connection closed like errors thrown ?? This might be also due to the use of serialport as channel. Maybe some error cases not handled properly. |
Yes, I can see the logs. Now it happened to me for the first time on a fresh start: $ l start
? Creating an instance "default" Proceed with the current configuration
INFO[0000] Starting the instance "default" with VM driver "qemu"
INFO[0000] QEMU binary "/opt/homebrew/bin/qemu-system-aarch64" seems properly signed with the "com.apple.security.hypervisor" entitlement
[...]
INFO[0099] [hostagent] Waiting for the final requirement 1 of 1: "boot scripts must have finished"
INFO[0099] [hostagent] The final requirement 1 of 1 is satisfied
ERRO[0099] [guest agent does not seem to be running; port forwards will not work]
WARN[0099] DEGRADED. The VM seems running, but file sharing and port forwarding may not work. (hint: see "/Users/jan/.lima/default/ha.stderr.log")
FATA[0099] degraded, status={Running:true Degraded:true Exiting:false Errors:[guest agent does not seem to be running; port forwards will not work] SSHLocalPort:60022} The guestagent log indicates that the agent was stopped and started again, which feels odd: $ lima sudo journalctl -b -u lima-guestagent
Jan 02 11:21:14 lima-default systemd[1]: Started lima-guestagent.service - lima-guestagent.
Jan 02 11:21:14 lima-default systemd[1]: Stopping lima-guestagent.service - lima-guestagent...
Jan 02 11:21:14 lima-default systemd[1]: lima-guestagent.service: Deactivated successfully.
Jan 02 11:21:14 lima-default systemd[1]: Stopped lima-guestagent.service - lima-guestagent.
Jan 02 11:21:14 lima-default systemd[1]: Started lima-guestagent.service - lima-guestagent.
Jan 02 11:21:14 lima-default lima-guestagent[2622]: time="2024-01-02T11:21:14-08:00" level=info msg="event tick: 3s"
Jan 02 11:21:14 lima-default lima-guestagent[2622]: time="2024-01-02T11:21:14-08:00" level=info msg="serving the guest agent on qemu serial file: io.lima-vm.guest_agent.0" I don't think this is always the case, but I will verify this. There is no error in |
Have reproduced with the bisect script (just running against $ lima sudo journalctl -b -u lima-guestagent
Jan 02 11:39:44 lima-default systemd[1]: Started lima-guestagent.service - lima-guestagent.
Jan 02 11:39:44 lima-default lima-guestagent[1346]: time="2024-01-02T11:39:44-08:00" level=info msg="event tick: 3s"
Jan 02 11:39:44 lima-default lima-guestagent[1346]: time="2024-01-02T11:39:44-08:00" level=info msg="serving the guest agent on qemu serial file: io.lima-vm.guest_agent.0"
Jan 02 11:39:48 lima-default systemd[1]: Stopping lima-guestagent.service - lima-guestagent...
Jan 02 11:39:48 lima-default systemd[1]: lima-guestagent.service: Deactivated successfully.
Jan 02 11:39:48 lima-default systemd[1]: Stopped lima-guestagent.service - lima-guestagent.
Jan 02 11:39:48 lima-default systemd[1]: lima-guestagent.service: Consumed 5.085s CPU time.
Jan 02 11:39:48 lima-default systemd[1]: Started lima-guestagent.service - lima-guestagent.
Jan 02 11:39:48 lima-default lima-guestagent[2118]: time="2024-01-02T11:39:48-08:00" level=info msg="event tick: 3s"
Jan 02 11:39:48 lima-default lima-guestagent[2118]: time="2024-01-02T11:39:48-08:00" level=info msg="serving the guest agent on qemu serial file: io.lima-vm.guest_agent.0" This time the first run of the guest agent gets to emit at least the 2 startup log lines, but otherwise looks the same. |
But running things on Alpine (again with $ l shell alpine sudo rc-status | grep guestagent
lima-guestagent [ started 00:02:14 (0) ]
$ l shell alpine sudo cat /var/log/lima-guestagent.log
time="2024-01-02T19:49:02Z" level=info msg="event tick: 3s"
time="2024-01-02T19:49:02Z" level=info msg="serving the guest agent on qemu serial file: io.lima-vm.guest_agent.0" I guess it is possible that the first run of guest agent didn't output anything, but I think the @balajiv113 It is pretty straight-forward to repro the issue: You must have just a single instance, and then run the script from #2064 (comment), which will stop and restart the instance until it breaks. Or edit the script to hard-code the instance name, if you have other instances that you need to keep. Failure typically happens in the first 5-10 iterations, but sometimes takes longer. |
Possible, but in my testing simply reverting 482678b (and installing |
@balajiv113 I was wrong; the problem seems to be indeed due to the use of serialport as a channel, and not related to the updated readiness test. The bisect script was not testing correctly for the error condition. I've since re-run the bisect between 0.18.0 and 0.19.0 with QEMU and the #!/usr/bin/env bash
mkdir -p bisect
make binaries
machine="$(limactl list --format '{{.Name}}')"
commit="$(git log -1 --format=%H)"
count=0
errors=0
while [[ "$count" -lt 100 ]]; do
limactl list
printf "\e[0;1;32mcount=%d errors=%d\e[0m\n" "${count}" "${errors}"
status="$(limactl list "${machine}" --format '{{ .Status }}')"
case "$status" in
Stopped)
if limactl --debug start "${machine}" >"bisect/${commit}.log" 2>&1; then
count=$((count+1))
sleep 7
if [ "$machine" = "alpine" ]; then
limactl shell --workdir / "${machine}" sudo timeout 10 nc -l -p 4444 -s 0.0.0.0 2>/dev/null
else
limactl shell --workdir / "${machine}" sudo timeout 10 nc -l 4444
fi
sleep 7
if ! grep --quiet '0.0.0.0:4444' "$HOME/.lima/${machine}/ha.stderr.log"; then
exit 1
fi
else
exit 125
fi
;;
Running) limactl stop "${machine}" ;;
esac
done Instead of relying on indirect evidence of the guest agent running, it opens the port 4444 inside the instance and then checks This bisect identifies f947de0 as the "first bad commit" causing the script to fail. So far I've still not seen any tests to fail with VZ, just with QEMU, so only the serial port code seems to be affected. |
…ication The serial port sometimes doesn't seem to work: lima-vm#2064 Signed-off-by: Jan Dubois <[email protected]>
…ication The serial port sometimes doesn't seem to work: lima-vm#2064 Signed-off-by: Jan Dubois <[email protected]> (cherry picked from commit a8ef354)
…ication The serial port sometimes doesn't seem to work: lima-vm#2064 Signed-off-by: Jan Dubois <[email protected]>
…ication The serial port sometimes doesn't seem to work: lima-vm#2064 Signed-off-by: Jan Dubois <[email protected]>
…ication The serial port sometimes doesn't seem to work: lima-vm#2064 Signed-off-by: Jan Dubois <[email protected]> (cherry picked from commit 9e1a07a)
…ication The serial port sometimes doesn't seem to work: lima-vm#2064 Signed-off-by: Jan Dubois <[email protected]>
It seems like the fix has been merged. Is there a timeline for new release so it can be pulled by |
Released v0.20. Will be available on Homebrew in a couple of hours |
I use lima 0.21.0 but still have the problem. |
same here, with lima 0.21.0 also |
Hi @jandubois I'm on macOS, Sonoma 14.4.1, arm64. $ uname -a
Darwin MACHINENAME 23.4.0 Darwin Kernel Version 23.4.0: Fri Mar 15 00:10:42 PDT 2024; root:xnu-10063.101.17~1/RELEASE_ARM64_T6000 arm64 |
@dubcl Please confirm if you are using QEMU or VZ |
sorry, QEMU $ qemu-system-x86_64 -version
QEMU emulator version 8.2.1 |
Thanks! That configuration should have been "fixed" by #2112 (and I could no longer reproduce it with Lima 0.20). Just for completeness, are you running the And does the guestagent disconnect right away, or only after the machine goes to sleep, or just after it has been running for a while? Does this happen with an empty instance, or only after you deploy some workload on it? Sorry for all the questions, but this bug has been really time-consuming to debug before, so any information that helps to reproduce the issue more reliably would help someone to investigate. |
I am using Linux (0n MAC OS) |
I already use 0.21, I think this happens when update from .20 to .21
yes I guess, this is my config: # This template requires Lima v0.7.0 or later.
arch: "x86_64"
cpus: 4
memory: "6GiB"
vmType: "qemu"
firmware:
legacyBIOS: true
images:
# Try to use release-yyyyMMdd image if available. Note that release-yyyyMMdd will be removed after several months.
- location: "https://cloud-images.ubuntu.com/releases/23.04/release-20230810/ubuntu-23.04-server-cloudimg-amd64.img"
arch: "x86_64"
digest: "sha256:5ad255d32a30a2cda9f0df19f0a6ce8d6f3c81b63845086a4cb5c43cf97fcb92"
- location: "https://cloud-images.ubuntu.com/releases/23.04/release-20230810/ubuntu-23.04-server-cloudimg-arm64.img"
arch: "aarch64"
digest: "sha256:af62ca6ba307388f7e0a8ad1c46103e6aea0130a09122e818df8d711637bf998"
# Fallback to the latest release image.
# Hint: run `limactl prune` to invalidate the cache
- location: "https://cloud-images.ubuntu.com/releases/23.04/release/ubuntu-23.04-server-cloudimg-amd64.img"
arch: "x86_64"
- location: "https://cloud-images.ubuntu.com/releases/23.04/release/ubuntu-23.04-server-cloudimg-arm64.img"
arch: "aarch64"
mounts:
- location: "~"
- location: "/tmp/lima"
writable: true
I'm not sure, I never put to sleep, is always on because I use it as desktop mostly.
I realize when I try to do a port forward for a nginx image that I tried to test
np, glad to help you. Additionally, to test, I removed the default folder and start a "new default" with same default.yaml and all work ok. To follow, If I get the error again, I will back to comment you. |
just got this error on v0.1.0 |
Well, at least not initially.
[Update: it seems to start, but hostagent cannot connect to it]
We've observed this multiple times with Rancher Desktop after switching to Lima 0.19.0 (rancher-sandbox/rancher-desktop#6151).
AFAIK, this was always with QEMU and not VZ, and always when restarting an existing instance, and not when creating a new one.
Rancher Desktop is using Alpine; I have not seen it with other distros, but that doesn't mean much.
I don't think guest-agent logs are written to file on Alpine; I will need to look into fixing that.
I've run into it twice today on different machines, but there don't seem to be any steps to reliably reproduce the problem.
The text was updated successfully, but these errors were encountered: