Skip to content

Bug: inferx_one container panics PanicHookInfo and restart-loops #14

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
SorenDreano opened this issue May 7, 2025 · 3 comments
Open

Comments

@SorenDreano
Copy link

First of all, thank you very much for this contribution, which could massively improve deployment of LLMs, it's a really exciting project.

Sadly, I did not manage to run it. On a clean GCP instance (specifically Vertex/Workbench) that I created to try inferx, with 1 A100 (recognized by nvidia-smi), I encounter a Rust panic from node_config.rs and I am sadly not familiar enough with Rust to find the root cause myself.
I have followed https://github.com/inferx-net/inferx/wiki/InferX-platform-0.1.0-deployment

root@f52e67f1b736:/# cat /etc/lsb-release 
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=20.04
DISTRIB_CODENAME=focal
DISTRIB_DESCRIPTION="Ubuntu 20.04.6 LTS"

I have tried with both Cuda 12.4 and 12.5.1
On a side note, you might want to compile the binaries with the

--remap-path-prefix

flag rust-lang/rust#41555

Installation

sudo su
cd /opt
wget https://github.com/inferx-net/inferx/releases/download/0.1.0/inferx.tar.gz
tar -zxvf inferx.tar.gz
nano /etc/docker/daemon.json

to write

{
    "runtimes": {
        "nvidia": {
            "args": [],
            "path": "nvidia-container-runtime"
        },
        "inferx": {
            "path": "/opt/inferx/bin/inferx"
        }
    }
}
systemctl restart docker
cd
git clone https://github.com/inferx-net/inferx

(cannot use the command provided in the documentation, as I encounter the following error:

(base) root@soren-inferx:~# git clone [email protected]:inferx-net/inferx.git
Cloning into 'inferx'...
The authenticity of host 'github.com (20.205.243.166)' can't be established.
ECDSA key fingerprint is SHA256:p2QAMXNIC1TJYWeIOttrVc98/R1BUFWu3/LiyKgUfQM.
Are you sure you want to continue connecting (yes/no/[fingerprint])? yes
Warning: Permanently added 'github.com,20.205.243.166' (ECDSA) to the list of known hosts.
[email protected]: Permission denied (publickey).
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.

)

cd inferx/
make run
cd /opt/inferx/data
wget https://github.com/inferx-net/inferx/releases/download/0.1.0/postgres_keycloak.tar.gz
tar -zxvf postgres_keycloak.tar.gz
rm postgres_keycloak.tar.gz
cd
cd inferx/
make stop
make run

Small Fixable Error

After make run, the .env file is empty.
On my system, using the command in the Makefile

(base) root@soren-inferx:~/inferx# LOCAL_IP=${hostname -I | awk '{print $$1}' | xargs}
bash: ${hostname -I | awk '{print $$1}' | xargs}: bad substitution

The output of the hostname -I command is

(base) root@soren-inferx:~/inferx# hostname -I
10.XX.XX.XX 172.17.0.1 172.18.0.1 172.19.0.1

and

hostname -I | awk '{print $$1}'
hostname -I | awk '{print $$1}' | xargs

both return a white string. I would recommend you to use

hostname -I | cut -d ' ' -f 1

which should work across all POSIX shells and with multiple addresses.

However, this is not the root cause of the problem, as fixing this white string or hardcoding 10.XX.XX.XX in the docker-compose.yml does not change the error I obtain in the logs.

Actual Error

(base) root@soren-inferx:~/inferx# curl localhost:81
<!doctype html>
<html lang=en>
<title>500 Internal Server Error</title>
<h1>Internal Server Error</h1>
<p>The server encountered an internal error and was unable to complete your request. Either the server is overloaded or there is an error in the application.</p>

The output of

docker ps

shows the container inferx_one boot-loops

a59d021268b2   inferx/inferx_one:v0.1.0                                                                       "/opt/nvidia/nvidia_…"   About a minute ago   Restarting (134) 26 seconds ago                                                                     inferx_one
8f3ade9a304c   inferx/inferx_dashboard:v0.1.0                                                                 "/bin/sh -c 'service…"   About a minute ago   Up About a minute (unhealthy)

the other containers are healthy.

Logs

The output of

docker logs inferx_one

is

==========
== CUDA ==
==========

CUDA Version 12.5.1

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

Panic occurred: PanicHookInfo { payload: Any { .. }, location: Location { file: "qshare/src/node_config.rs", line: 172, col: 21 }, can_unwind: true, force_no_backtrace: false }
Backtrace:
   0: onenode::main::{{closure}}::{{closure}}
             at /home/brad/rust/ffly/qservice/onenode/one_main.rs:55:25
   1: <alloc::boxed::Box<F,A> as core::ops::function::Fn<Args>>::call
             at /rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/alloc/src/boxed.rs:2245:9
      std::panicking::rust_panic_with_hook
             at /rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/std/src/panicking.rs:805:13
   2: std::panicking::begin_panic_handler::{{closure}}
             at /rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/std/src/panicking.rs:664:13
   3: std::sys::backtrace::__rust_end_short_backtrace
             at /rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/std/src/sys/backtrace.rs:170:18
   4: rust_begin_unwind
             at /rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/std/src/panicking.rs:662:5
   5: core::panicking::panic_fmt
             at /rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/core/src/panicking.rs:74:14
   6: qshare::node_config::QletConfig::New
             at /home/brad/rust/ffly/qservice/qshare/src/node_config.rs:172:21
   7: <qshare::qlet::QLET_CONFIG as core::ops::deref::Deref>::deref::__static_ref_initialize
             at /home/brad/rust/ffly/qservice/qshare/src/qlet/mod.rs:27:46
      core::ops::function::FnOnce::call_once
             at /rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/core/src/ops/function.rs:250:5
   8: spin::once::Once<T>::call_once
             at /home/brad/.cargo/registry/src/index.crates.io-6f17d22bba15001f/spin-0.5.2/src/once.rs:110:50
   9: lazy_static::lazy::Lazy<T>::get
             at /home/brad/.cargo/registry/src/index.crates.io-6f17d22bba15001f/lazy_static-1.4.0/src/core_lazy.rs:21:9
      <qshare::qlet::QLET_CONFIG as core::ops::deref::Deref>::deref::__stability
             at /home/brad/.cargo/registry/src/index.crates.io-6f17d22bba15001f/lazy_static-1.4.0/src/lib.rs:142:21
      <qshare::qlet::QLET_CONFIG as core::ops::deref::Deref>::deref
             at /home/brad/.cargo/registry/src/index.crates.io-6f17d22bba15001f/lazy_static-1.4.0/src/lib.rs:144:17
  10: onenode::main::{{closure}}
             at /home/brad/rust/ffly/qservice/onenode/one_main.rs:93:27
  11: <core::pin::Pin<P> as core::future::future::Future>::poll
             at /rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/core/src/future/future.rs:123:9
  12: tokio::runtime::park::CachedParkThread::block_on::{{closure}}
             at /home/brad/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.44.2/src/runtime/park.rs:284:60
  13: tokio::task::coop::with_budget
             at /home/brad/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.44.2/src/task/coop/mod.rs:167:5
      tokio::task::coop::budget
             at /home/brad/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.44.2/src/task/coop/mod.rs:133:5
      tokio::runtime::park::CachedParkThread::block_on
             at /home/brad/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.44.2/src/runtime/park.rs:284:31
  14: tokio::runtime::context::blocking::BlockingRegionGuard::block_on
             at /home/brad/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.44.2/src/runtime/context/blocking.rs:66:9
  15: tokio::runtime::scheduler::multi_thread::MultiThread::block_on::{{closure}}
             at /home/brad/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.44.2/src/runtime/scheduler/multi_thread/mod.rs:87:13
  16: tokio::runtime::context::runtime::enter_runtime
             at /home/brad/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.44.2/src/runtime/context/runtime.rs:65:16
  17: tokio::runtime::scheduler::multi_thread::MultiThread::block_on
             at /home/brad/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.44.2/src/runtime/scheduler/multi_thread/mod.rs:86:9
  18: tokio::runtime::runtime::Runtime::block_on_inner
             at /home/brad/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.44.2/src/runtime/runtime.rs:370:45
  19: tokio::runtime::runtime::Runtime::block_on
             at /home/brad/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.44.2/src/runtime/runtime.rs:340:13
  20: onenode::main
             at /home/brad/rust/ffly/qservice/onenode/one_main.rs:112:5
  21: core::ops::function::FnOnce::call_once
             at /rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/core/src/ops/function.rs:250:5
  22: std::sys::backtrace::__rust_begin_short_backtrace
             at /rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/std/src/sys/backtrace.rs:154:18
  23: std::rt::lang_start::{{closure}}
             at /rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/std/src/rt.rs:164:18
  24: core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &F>::call_once
             at /rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/core/src/ops/function.rs:284:13
      std::panicking::try::do_call
             at /rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/std/src/panicking.rs:554:40
      std::panicking::try
             at /rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/std/src/panicking.rs:518:19
      std::panic::catch_unwind
             at /rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/std/src/panic.rs:345:14
      std::rt::lang_start_internal::{{closure}}
             at /rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/std/src/rt.rs:143:48
      std::panicking::try::do_call
             at /rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/std/src/panicking.rs:554:40
      std::panicking::try
             at /rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/std/src/panicking.rs:518:19
      std::panic::catch_unwind
             at /rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/std/src/panic.rs:345:14
      std::rt::lang_start_internal
             at /rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/std/src/rt.rs:143:20
  25: std::rt::lang_start
             at /rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/std/src/rt.rs:163:17
  26: main
  27: __libc_start_main
  28: _start

panicked at onenode/one_main.rs:49:54:

thread panicked while processing panic. aborting.

and the output of

docker logs inferx_dashboard

is

Starting nginx: nginx.
[2025-05-07 08:58:20 +0000] [24] [INFO] Starting gunicorn 23.0.0
[2025-05-07 08:58:20 +0000] [24] [INFO] Listening at: http://0.0.0.0:1250 (24)
[2025-05-07 08:58:20 +0000] [24] [INFO] Using worker: gevent
[2025-05-07 08:58:20 +0000] [33] [INFO] Booting worker with pid: 33
[2025-05-07 08:58:20 +0000] [34] [INFO] Booting worker with pid: 34
[2025-05-07 08:58:20 +0000] [35] [INFO] Booting worker with pid: 35
[2025-05-07 08:58:20 +0000] [36] [INFO] Booting worker with pid: 36
[2025-05-07 09:00:07,969] ERROR in app: Exception on / [GET]
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/urllib3/connection.py", line 174, in _new_conn
    conn = connection.create_connection(
  File "/usr/local/lib/python3.10/site-packages/urllib3/util/connection.py", line 95, in create_connection
    raise err
  File "/usr/local/lib/python3.10/site-packages/urllib3/util/connection.py", line 85, in create_connection
    sock.connect(sa)
  File "/usr/local/lib/python3.10/site-packages/gevent/_socketcommon.py", line 590, in connect
    self._internal_connect(address)
  File "/usr/local/lib/python3.10/site-packages/gevent/_socketcommon.py", line 634, in _internal_connect
    raise _SocketError(err, strerror(err))
ConnectionRefusedError: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 716, in urlopen
    httplib_response = self._make_request(
  File "/usr/local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 416, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/usr/local/lib/python3.10/site-packages/urllib3/connection.py", line 244, in request
    super(HTTPConnection, self).request(method, url, body=body, headers=headers)
  File "/usr/local/lib/python3.10/http/client.py", line 1283, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/local/lib/python3.10/http/client.py", line 1329, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/local/lib/python3.10/http/client.py", line 1278, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/usr/local/lib/python3.10/http/client.py", line 1038, in _send_output
    self.send(msg)
  File "/usr/local/lib/python3.10/http/client.py", line 976, in send
    self.connect()
  File "/usr/local/lib/python3.10/site-packages/urllib3/connection.py", line 205, in connect
    conn = self._new_conn()
  File "/usr/local/lib/python3.10/site-packages/urllib3/connection.py", line 186, in _new_conn
    raise NewConnectionError(
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7f4f9a0ccfd0>: Failed to establish a new connection: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/requests/adapters.py", line 439, in send
    resp = conn.urlopen(
  File "/usr/local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 802, in urlopen
    retries = retries.increment(
  File "/usr/local/lib/python3.10/site-packages/urllib3/util/retry.py", line 594, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='localhost', port=4000): Max retries exceeded with url: /functions/// (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f4f9a0ccfd0>: Failed to establish a new connection: [Errno 111] Connection refused'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/flask/app.py", line 2190, in wsgi_app
    response = self.full_dispatch_request()
  File "/usr/local/lib/python3.10/site-packages/flask/app.py", line 1486, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/usr/local/lib/python3.10/site-packages/flask/app.py", line 1484, in full_dispatch_request
    rv = self.dispatch_request()
  File "/usr/local/lib/python3.10/site-packages/flask/app.py", line 1469, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)
  File "/app.py", line 121, in wrapper
    return func(*args, **kwargs)
  File "/app.py", line 624, in ListFunc
    funcs = listfuncs("", "")
  File "/app.py", line 277, in listfuncs
    resp = requests.get(url, headers=headers)
  File "/usr/local/lib/python3.10/site-packages/requests/api.py", line 76, in get
    return request('get', url, params=params, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/requests/api.py", line 61, in request
    return session.request(method=method, url=url, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/requests/sessions.py", line 542, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python3.10/site-packages/requests/sessions.py", line 655, in send
    r = adapter.send(request, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/requests/adapters.py", line 516, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='localhost', port=4000): Max retries exceeded with url: /functions/// (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f4f9a0ccfd0>: Failed to establish a new connection: [Errno 111] Connection refused'))
@inferx-net
Copy link
Owner

@SorenDreano Thank you!

I did some debug for the issue and found the issue is because the inferx service can't get local node ip addresss with rust code local_ip_address::list_afinet_netifas().

Could you please help confirm whether thhe compute node is a Virutal Machine or Barametal?

In the meantime, I will add a workaround to address this issue, i.e. enable inferx service to get loca IP address from environmen variable so that user can configure it manually. This is also an feature to support k8s based deployment. I will let you know when it is ready.

And if it is a Virtual Machine, there will be another problem. InferX will be better to run in baremetal because its rely on Linux KVM to start InferX Virtual Machine based secure container runtime. If there is only Virtual Machine, we have to enable "nested virtualization" as https://cloud.google.com/compute/docs/instances/nested-virtualization/overview. But this will decrease the performance.

@inferx-net
Copy link
Owner

@SorenDreano I have fixed this issue by adding "LOCAL_IP" environment variable. I add the fix in the branch LOCAL_IP.

The code fix is at

@echo "LOCAL_IP=$$(hostname -I | awk '{print $$1}' | xargs)" > .env
and
- POD_IP=${LOCAL_IP}

Could you please have a try in the branch LOCAL_IP?

Please run "make run" under root folder.

@SorenDreano
Copy link
Author

SorenDreano commented May 12, 2025

Hello,

With your recent changes, it works much better, I no longer have any bootlooping container and I can access the dashboard. I indeed confirm that I am using a virtualized environment and not a baremetal one.

I will try to set up a few models.

Kind regards

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants