FLARE implementation #143

prometherion · 2025-08-14T09:44:56Z

No description provided.

Signed-off-by: Dario Tranchitella <[email protected]>

rmedina97 · 2025-08-28T08:52:57Z

I tried testing this but noticed that the Flavor doesn’t reflect some values from the FLARE annotations (GPU count and GPU memory stay at 0). Are the annotations the same as in this repository https://github.com/clastix/flare? For testing, should I follow the quickstart from that repo, or are there specific steps/examples for this PR to make validation easier?

prometherion · 2025-08-28T13:01:01Z

Are the annotations the same as in this repository https://github.com/clastix/flare?

Yes, I implemented the required code following that documentation: just sharing the annotations and the resulting flavor.

// kubectl get nodes fluidos-provider-2-worker -ojsonpath='{.metadata.annotations}'
{
  "cost.fluidos.eu/currency": "EUR",
  "cost.fluidos.eu/hourly-rate": "2.1",
  "gpu.fluidos.eu/architecture": "ampere",
  "gpu.fluidos.eu/clock-speed": "1.80G",
  "gpu.fluidos.eu/compute-capability": "8.6",
  "gpu.fluidos.eu/cores": "10752",
  "gpu.fluidos.eu/count": "8",
  "gpu.fluidos.eu/dedicated": "true",
  "gpu.fluidos.eu/fp32-tflops": "38.7",
  "gpu.fluidos.eu/interconnect": "nvlink",
  "gpu.fluidos.eu/interconnect-bandwidth-gbps": "600",
  "gpu.fluidos.eu/interruptible": "false",
  "gpu.fluidos.eu/memory-per-gpu": "48Gi",
  "gpu.fluidos.eu/model": "nvidia-a6000",
  "gpu.fluidos.eu/multi-gpu-efficiency": "0.85",
  "gpu.fluidos.eu/sharing-capable": "false",
  "gpu.fluidos.eu/sharing-strategy": "none",
  "gpu.fluidos.eu/tier": "standard",
  "gpu.fluidos.eu/topology": "ring",
  "gpu.fluidos.eu/vendor": "nvidia",
  "kubeadm.alpha.kubernetes.io/cri-socket": "unix:///run/containerd/containerd.sock",
  "location.fluidos.eu/region": "us-east-1",
  "location.fluidos.eu/zone": "zone-a",
  "network.fluidos.eu/bandwidth-gbps": "25",
  "network.fluidos.eu/latency-ms": "5",
  "network.fluidos.eu/tier": "standard",
  "node.alpha.kubernetes.io/ttl": "0",
  "provider.fluidos.eu/name": "cloud-provider-2",
  "provider.fluidos.eu/preemptible": "false",
  "volumes.kubernetes.io/controller-managed-attach-detach": "true",
  "workload.fluidos.eu/graphics-score": "0.95",
  "workload.fluidos.eu/hpc-score": "0.80",
  "workload.fluidos.eu/inference-score": "0.90",
  "workload.fluidos.eu/training-score": "0.85"
}

// kubectl get flavors.nodecore.fluidos.eu fluidos.eu-k8slice-89ad -ojsonpath='{.spec.flavorType.typeData.characteristics.gpu}'|jq
{
  "architecture": "ampere",
  "clock_speed": "1800M",
  "compute_capability": "8.6",
  "cores": "86016",
  "count": 8,
  "dedicated": true,
  "fp32_tflops": 38.7,
  "graphics_score": 0.95,
  "hourly_rate": 2.1,
  "hpc_score": 0.8,
  "inference_score": 0.9,
  "interconnect": "nvlink",
  "interconnect_bandwidth": "600",
  "memory": "384Gi",
  "model": "nvidia-a6000",
  "multi_gpu_efficiency": "0.85",
  "network_bandwidth": "25",
  "network_latency_ms": 5,
  "network_tier": "standard",
  "provider": "cloud-provider-2",
  "region": "zone-a",
  "sharing_strategy": "none",
  "tier": "standard",
  "topology": "ring",
  "training_score": 0.85,
  "vendor": "nvidia",
  "zone": "us-east-1"
}

For testing, should I follow the quickstart from that repo, or are there specific steps/examples for this PR to make validation easier?

Yes, we're going to release the code before the end of the week: testing will be easier.

prometherion force-pushed the flare branch 2 times, most recently from f1636b5 to 4d0980b Compare August 14, 2025 09:52

frisso requested review from LorenzoMoro, rmedina97 and stefano81 August 16, 2025 09:17

prometherion force-pushed the flare branch from 4d0980b to d1af7e5 Compare August 22, 2025 09:23

prometherion added 10 commits August 24, 2025 16:30

fix: supporting disabled webhooks for controller

020aedc

Signed-off-by: Dario Tranchitella <[email protected]>

optim: ignoring not found errors whe flavors are deleted

bde8505

Signed-off-by: Dario Tranchitella <[email protected]>

refactor: reusing same context for all runnables

e955d78

Signed-off-by: Dario Tranchitella <[email protected]>

feat: support create or update of peeringcandidate discovery

381acc9

Signed-off-by: Dario Tranchitella <[email protected]>

fix: allowing multiple peering from the same provider

eeaf8e4

Signed-off-by: Dario Tranchitella <[email protected]>

feat(api): expanding GPU traits with FLARE ones

e3f343d

Signed-off-by: Dario Tranchitella <[email protected]>

feat: idempotent reconciliation of flavors with FLARE traits

330790a

Signed-off-by: Dario Tranchitella <[email protected]>

feat: gpu traits filtering

78ce6ac

Signed-off-by: Dario Tranchitella <[email protected]>

optim: build and load to dynamic kind clusters

148dd23

Signed-off-by: Dario Tranchitella <[email protected]>

fix: permission +x for install_liqo.sh

b4db5ae

Signed-off-by: Dario Tranchitella <[email protected]>

prometherion force-pushed the flare branch from d1af7e5 to b4db5ae Compare August 24, 2025 14:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

FLARE implementation #143

FLARE implementation #143

Uh oh!

prometherion commented Aug 14, 2025

Uh oh!

rmedina97 commented Aug 28, 2025

Uh oh!

prometherion commented Aug 28, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

FLARE implementation #143

Are you sure you want to change the base?

FLARE implementation #143

Uh oh!

Conversation

prometherion commented Aug 14, 2025

Uh oh!

rmedina97 commented Aug 28, 2025

Uh oh!

prometherion commented Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

prometherion commented Aug 28, 2025 •

edited

Loading