Skip to content

Commit f7a3afb

Browse files
authored
fix: update tsdb installation, pod exec role, gpu model info (#77)
1 parent a1e1fb5 commit f7a3afb

File tree

8 files changed

+209
-95
lines changed

8 files changed

+209
-95
lines changed

README.md

+14-31
Original file line numberDiff line numberDiff line change
@@ -49,29 +49,11 @@ WIP
4949
- [Getting Started on VM](https://tensor-fusion.ai/guide/deployment-vm)
5050
- [Deploy Self-hosted Community Edition](https://tensor-fusion.ai/guide/self-host)
5151

52-
### Try it out
52+
<!-- (TODO: Asciinema) -->
5353

54-
- Explore the demo account: [Demo Console - Working in progress](https://app.tensor-fusion.ai?hint=demo)
54+
<!-- ### Playground
5555
56-
- Run following command to try TensorFusion in 3 minutes
57-
58-
```bash
59-
# Step 1: Install TensorFusion in Kubernetes
60-
helm install --repo https://nexusgpu.github.io/tensor-fusion/ --create-namespace
61-
62-
# Step 2. Onboard GPU nodes into TensorFusion cluster
63-
kubectl apply -f https://raw.githubusercontent.com/NexusGPU/tensor-fusion/main/manifests/gpu-node.yaml
64-
65-
# Step 3. Check if cluster and pool is ready
66-
kubectl get gpupools -o wide && kubectl get gpunodes -o wide
67-
68-
# Step3. Create an inference app using virtual, remote GPU resources in TensorFusion cluster
69-
kubectl apply -f https://raw.githubusercontent.com/NexusGPU/tensor-fusion/main/manifests/inference-app.yaml
70-
71-
# Then you can forward the port to test inference, or exec shell
72-
```
73-
74-
(TODO: Asciinema)
56+
- Explore the demo account: [Demo Console - Working in progress](https://app.tensor-fusion.ai?hint=demo) -->
7557

7658
### 💬 Discussion
7759

@@ -87,28 +69,29 @@ kubectl apply -f https://raw.githubusercontent.com/NexusGPU/tensor-fusion/main/m
8769
### Core GPU Virtualization Features
8870

8971
- [x] Fractional GPU and flexible oversubscription
90-
- [x] GPU-over-IP, remote GPU sharing with less than 4% performance loss
91-
- [x] GPU VRAM expansion or swap to host RAM
72+
- [x] Remote GPU sharing with SOTA GPU-over-IP technology, less than 4% performance loss
73+
- [x] GPU VRAM expansion and hot/warm/cold tiering
9274
- [ ] None NVIDIA GPU/NPU vendor support
9375

9476
### Pooling & Scheduling & Management
9577

9678
- [x] GPU/NPU pool management in Kubernetes
97-
- [x] GPU-first resource scheduler based on virtual TFlops/VRAM capacity
98-
- [x] GPU-first auto provisioning and bin-packing
79+
- [x] GPU-first scheduling and allocation, with single TFlops/MB precision
80+
- [x] GPU node auto provisioning/termination
81+
- [x] GPU compaction/bin-packing
9982
- [x] Seamless onboarding experience for Pytorch, TensorFlow, llama.cpp, vLLM, Tensor-RT, SGlang and all popular AI training/serving frameworks
100-
- [x] Basic management console and dashboards
101-
- [ ] Basic autoscaling policies, auto set requests/limits/replicas
102-
- [ ] GPU Group scheduling for LLMs
83+
- [x] Centralized Dashboard & Control Plane
84+
- [ ] GPU-first autoscaling policies, auto set requests/limits/replicas
85+
- [ ] Request multiple vGPUs with group scheduling for large models
10386
- [ ] Support different QoS levels
10487

10588
### Enterprise Features
10689

107-
- [x] GPU live-migration, fastest in the world
108-
- [ ] Preloading and P2P distribution of container images, AI models, GPU snapshots etc.
90+
- [x] GPU live-migration, snapshot/distribute/restore GPU context cross cluster, fastest in the world
91+
- [ ] AI model registry and preloading, build your own private MaaS(Model-as-a-Service)
10992
- [ ] Advanced auto-scaling policies, scale to zero, rebalance of hot GPUs
11093
- [ ] Advanced observability features, detailed metrics & tracing/profiling of CUDA calls
111-
- [ ] Multi-tenancy billing based on actual usage
94+
- [ ] Monetization your GPU cluster by multi-tenancy usage measurement & billing report
11295
- [ ] Enterprise level high availability and resilience, support topology aware scheduling, GPU node auto failover etc.
11396
- [ ] Enterprise level security, complete on-premise deployment support, encryption in-transit & at-rest
11497
- [ ] Enterprise level compliance, SSO/SAML support, advanced audit, ReBAC control, SOC2 and other compliance reports available

charts/tensor-fusion/Chart.yaml

+1-1
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ type: application
1515
# This is the chart version. This version number should be incremented each time you make changes
1616
# to the chart and its templates, including the app version.
1717
# Versions are expected to follow Semantic Versioning (https://semver.org/)
18-
version: 1.2.8
18+
version: 1.2.9
1919

2020
# This is the version number of the application being deployed. This version number should be
2121
# incremented each time you make changes to the application. Versions are not expected to

charts/tensor-fusion/templates/gpu-public-gpu-info.yaml

+14-14
Original file line numberDiff line numberDiff line change
@@ -29,32 +29,32 @@ data:
2929
3030
# Ampere Architecture Series
3131
- model: A100_SXM4
32-
fullModelName: "A100 SXM4"
32+
fullModelName: "NVIDIA A100 SXM4"
3333
vendor: NVIDIA
3434
costPerHour: 1.89
3535
fp16TFlops: 312
3636
3737
- model: A100_PCIe
38-
fullModelName: "A100 PCIe"
38+
fullModelName: "NVIDIA A100 PCIe"
3939
vendor: NVIDIA
4040
costPerHour: 1.64
4141
fp16TFlops: 312
4242
4343
- model: A10
44-
fullModelName: "A10"
44+
fullModelName: "NVIDIA A10"
4545
vendor: NVIDIA
4646
costPerHour: 0.9
4747
fp16TFlops: 125
4848
4949
# A10G has less CUDA core than A10, but with RT cores for rendering case
5050
- model: A10G
51-
fullModelName: "A10G"
51+
fullModelName: "NVIDIA A10G"
5252
vendor: NVIDIA
5353
costPerHour: 0.75 # from lambda labs
54-
fp16TFlops: 125
54+
fp16TFlops: 63
5555
5656
- model: A40
57-
fullModelName: "A40"
57+
fullModelName: "NVIDIA A40"
5858
vendor: NVIDIA
5959
costPerHour: 0.44
6060
fp16TFlops: 125
@@ -67,22 +67,22 @@ data:
6767
6868
# Ada Lovelace Architecture Series
6969
- model: L4
70-
fullModelName: "L4"
70+
fullModelName: "NVIDIA L4"
7171
vendor: NVIDIA
7272
costPerHour: 0.43
7373
fp16TFlops: 121
7474
7575
- model: L40
76-
fullModelName: "L40"
76+
fullModelName: "NVIDIA L40"
7777
vendor: NVIDIA
7878
costPerHour: 0.86 # should be a bit cheaper than L40s
79-
fp16TFlops: 362
79+
fp16TFlops: 181
8080
8181
- model: L40s
82-
fullModelName: "L40s"
82+
fullModelName: "NVIDIA L40s"
8383
vendor: NVIDIA
8484
costPerHour: 0.86
85-
fp16TFlops: 362
85+
fp16TFlops: 181
8686
8787
- model: RTX4090
8888
fullModelName: "RTX4090"
@@ -92,20 +92,20 @@ data:
9292
9393
# Hopper Architecture Series
9494
- model: H100_SXM4
95-
fullModelName: "H100 SXM4"
95+
fullModelName: "NVIDIA H100 SXM4"
9696
vendor: NVIDIA
9797
costPerHour: 2.99
9898
fp16TFlops: 989
9999
100100
- model: H100_PCIe
101-
fullModelName: "H100 PCIe"
101+
fullModelName: "NVIDIA H100 PCIe"
102102
vendor: NVIDIA
103103
costPerHour: 2.39
104104
fp16TFlops: 835
105105
106106
# Blackwell Architecture Series
107107
- model: B200_SXM4
108-
fullModelName: "B200 SXM4"
108+
fullModelName: "NVIDIA B200 SXM4"
109109
vendor: NVIDIA
110110
costPerHour: 10.99 # unknown price,on-request
111111
fp16TFlops: 2250
Original file line numberDiff line numberDiff line change
@@ -1,60 +1,172 @@
11
{{- if .Values.greptime.installStandalone }}
2-
# NOTICE: make sure greptimedb operator had been installed in your test cluster
2+
# NOTICE: make sure greptimedb operator had been installed in your cluster if not enable 'installStandalone'
33
# cloud mode is recommended to reduce the maintenance effort
44
# ```bash
55
# helm repo add greptime https://greptimeteam.github.io/helm-charts/
66
# helm repo update
77
# helm install greptimedb-operator greptime/greptimedb-operator -n greptimedb --create-namespace
88
# ```
9-
apiVersion: greptime.io/v1alpha1
10-
kind: GreptimeDBStandalone
9+
---
10+
apiVersion: v1
11+
kind: Namespace
1112
metadata:
1213
name: greptimedb
14+
---
15+
apiVersion: v1
16+
kind: ConfigMap
17+
metadata:
18+
name: greptimedb-standalone
19+
namespace: greptimedb
20+
data:
21+
config.toml: |
22+
[logging]
23+
dir = "/data/greptimedb/logs"
24+
level = "info"
25+
log_format = "text"
26+
27+
[storage]
28+
data_home = "/data/greptimedb"
29+
30+
[wal]
31+
dir = "/data/greptimedb/wal"
32+
---
33+
apiVersion: v1
34+
kind: Service
35+
metadata:
36+
name: greptimedb-standalone
37+
namespace: greptimedb
38+
labels:
39+
app.greptime.io/component: greptimedb-standalone
40+
spec:
41+
selector:
42+
app.greptime.io/component: greptimedb-standalone
43+
ports:
44+
- name: grpc
45+
port: 4001
46+
targetPort: 4001
47+
- name: http
48+
port: 4000
49+
targetPort: 4000
50+
- name: mysql
51+
port: 4002
52+
targetPort: 4002
53+
- name: postgres
54+
port: 4003
55+
targetPort: 4003
56+
---
57+
apiVersion: apps/v1
58+
kind: StatefulSet
59+
metadata:
60+
name: greptimedb-standalone
1361
namespace: greptimedb
62+
labels:
63+
app.greptime.io/component: greptimedb-standalone
1464
spec:
15-
base:
16-
main:
17-
image: docker.io/greptime/greptimedb:latest
18-
livenessProbe:
19-
failureThreshold: 10
20-
httpGet:
21-
path: /health
22-
port: 4000
23-
periodSeconds: 5
24-
readinessProbe:
25-
failureThreshold: 10
26-
httpGet:
27-
path: /health
28-
port: 4000
29-
periodSeconds: 5
30-
resources: {}
31-
startupProbe:
32-
failureThreshold: 60
33-
httpGet:
34-
path: /health
35-
port: 4000
36-
periodSeconds: 5
37-
datanodeStorage:
38-
dataHome: /data/greptimedb
39-
fs:
40-
mountPath: /data/greptimedb
41-
name: datanode
42-
storageRetainPolicy: Retain
43-
storageSize: 20Gi
44-
httpPort: 4000
45-
logging:
46-
format: text
47-
level: info
48-
logsDir: /data/greptimedb/logs
49-
onlyLogToStdout: false
50-
persistentWithData: false
51-
mysqlPort: 4002
52-
postgreSQLPort: 4003
53-
rollingUpdate:
54-
maxUnavailable: 1
55-
partition: 0
56-
rpcPort: 4001
57-
service:
58-
type: ClusterIP
59-
version: latest
65+
replicas: 1
66+
selector:
67+
matchLabels:
68+
app.greptime.io/component: greptimedb-standalone
69+
template:
70+
metadata:
71+
labels:
72+
app.greptime.io/component: greptimedb-standalone
73+
spec:
74+
volumes:
75+
- name: logs
76+
emptyDir: {}
77+
- name: config
78+
configMap:
79+
name: greptimedb-standalone
80+
defaultMode: 420
81+
containers:
82+
- name: standalone
83+
image: docker.io/greptime/greptimedb:latest
84+
args:
85+
- standalone
86+
- start
87+
- '--rpc-bind-addr'
88+
- 0.0.0.0:4001
89+
- '--mysql-addr'
90+
- 0.0.0.0:4002
91+
- '--http-addr'
92+
- 0.0.0.0:4000
93+
- '--postgres-addr'
94+
- 0.0.0.0:4003
95+
- '--config-file'
96+
- /etc/greptimedb/config.toml
97+
ports:
98+
- name: grpc
99+
containerPort: 4001
100+
protocol: TCP
101+
- name: http
102+
containerPort: 4000
103+
protocol: TCP
104+
- name: mysql
105+
containerPort: 4002
106+
protocol: TCP
107+
- name: postgres
108+
containerPort: 4003
109+
protocol: TCP
110+
resources: {}
111+
volumeMounts:
112+
- name: datanode
113+
mountPath: /data/greptimedb
114+
- name: logs
115+
mountPath: /data/greptimedb/logs
116+
- name: config
117+
mountPath: /etc/greptimedb
118+
livenessProbe:
119+
httpGet:
120+
path: /health
121+
port: 4000
122+
scheme: HTTP
123+
timeoutSeconds: 1
124+
periodSeconds: 5
125+
successThreshold: 1
126+
failureThreshold: 10
127+
readinessProbe:
128+
httpGet:
129+
path: /health
130+
port: 4000
131+
scheme: HTTP
132+
timeoutSeconds: 1
133+
periodSeconds: 5
134+
successThreshold: 1
135+
failureThreshold: 10
136+
startupProbe:
137+
httpGet:
138+
path: /health
139+
port: 4000
140+
scheme: HTTP
141+
timeoutSeconds: 1
142+
periodSeconds: 5
143+
successThreshold: 1
144+
failureThreshold: 60
145+
imagePullPolicy: Always
146+
restartPolicy: Always
147+
terminationGracePeriodSeconds: 30
148+
dnsPolicy: ClusterFirst
149+
volumeClaimTemplates:
150+
- kind: PersistentVolumeClaim
151+
apiVersion: v1
152+
metadata:
153+
name: datanode
154+
creationTimestamp: null
155+
spec:
156+
accessModes:
157+
- ReadWriteOnce
158+
resources:
159+
requests:
160+
storage: 20Gi
161+
volumeMode: Filesystem
162+
serviceName: ''
163+
podManagementPolicy: OrderedReady
164+
updateStrategy:
165+
type: RollingUpdate
166+
rollingUpdate:
167+
partition: 0
168+
revisionHistoryLimit: 10
169+
persistentVolumeClaimRetentionPolicy:
170+
whenDeleted: Retain
171+
whenScaled: Retain
60172
{{- end }}

0 commit comments

Comments
 (0)