-
Notifications
You must be signed in to change notification settings - Fork 6
Bug: inferx_one container panics PanicHookInfo and restart-loops #14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@SorenDreano Thank you! I did some debug for the issue and found the issue is because the inferx service can't get local node ip addresss with rust code local_ip_address::list_afinet_netifas(). Could you please help confirm whether thhe compute node is a Virutal Machine or Barametal? In the meantime, I will add a workaround to address this issue, i.e. enable inferx service to get loca IP address from environmen variable so that user can configure it manually. This is also an feature to support k8s based deployment. I will let you know when it is ready. And if it is a Virtual Machine, there will be another problem. InferX will be better to run in baremetal because its rely on Linux KVM to start InferX Virtual Machine based secure container runtime. If there is only Virtual Machine, we have to enable "nested virtualization" as https://cloud.google.com/compute/docs/instances/nested-virtualization/overview. But this will decrease the performance. |
@SorenDreano I have fixed this issue by adding "LOCAL_IP" environment variable. I add the fix in the branch LOCAL_IP. The code fix is at Line 60 in 0acf1f3
Line 57 in 0acf1f3
Could you please have a try in the branch LOCAL_IP? Please run "make run" under root folder. |
Hello, With your recent changes, it works much better, I no longer have any bootlooping container and I can access the dashboard. I indeed confirm that I am using a virtualized environment and not a baremetal one. I will try to set up a few models. Kind regards |
First of all, thank you very much for this contribution, which could massively improve deployment of LLMs, it's a really exciting project.
Sadly, I did not manage to run it. On a clean GCP instance (specifically Vertex/Workbench) that I created to try inferx, with 1 A100 (recognized by nvidia-smi), I encounter a Rust panic from node_config.rs and I am sadly not familiar enough with Rust to find the root cause myself.
I have followed https://github.com/inferx-net/inferx/wiki/InferX-platform-0.1.0-deployment
I have tried with both Cuda 12.4 and 12.5.1
On a side note, you might want to compile the binaries with the
flag rust-lang/rust#41555
Installation
sudo su cd /opt wget https://github.com/inferx-net/inferx/releases/download/0.1.0/inferx.tar.gz tar -zxvf inferx.tar.gz nano /etc/docker/daemon.json
to write
systemctl restart docker cd git clone https://github.com/inferx-net/inferx
(cannot use the command provided in the documentation, as I encounter the following error:
)
Small Fixable Error
After make run, the .env file is empty.
On my system, using the command in the Makefile
The output of the hostname -I command is
(base) root@soren-inferx:~/inferx# hostname -I 10.XX.XX.XX 172.17.0.1 172.18.0.1 172.19.0.1
and
both return a white string. I would recommend you to use
which should work across all POSIX shells and with multiple addresses.
However, this is not the root cause of the problem, as fixing this white string or hardcoding 10.XX.XX.XX in the docker-compose.yml does not change the error I obtain in the logs.
Actual Error
The output of
shows the container inferx_one boot-loops
the other containers are healthy.
Logs
The output of
is
and the output of
is
The text was updated successfully, but these errors were encountered: