Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion qemu/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -26,4 +26,4 @@ net:

run:
@echo "Running your node"
sudo ./vm.sh -g -n node-02 -c "farmer_id=$(id) version=v3 printk.devmsg=on runmode=dev nomodeset ssh-user=$(user)"
sudo ./vm.sh -g -n node-02 -c "farmer_id=$(id) version=v4 printk.devmsg=on runmode=dev nomodeset ssh-user=$(user)"
4 changes: 2 additions & 2 deletions qemu/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

| For a quick development docs check [here](../docs/development/README.md)

This folder contains a script that you can use to run 0-OS in a VM using qemu.
his folder contains a script that you can use to run 0-OS in a VM using qemu.

## Requirements

Expand Down Expand Up @@ -112,7 +112,7 @@ sudo dnsmasq --strict-order \
1. Now run your vm

```bash
sudo ./vm.sh -n node-01 -c "farmer_id=47 printk.devmsg=on runmode=dev ssh-user=<github username>"
sudo ./vm.sh -n node-01 -c "farmer_id=1 version=v4 printk.devmsg=on runmode=dev ssh-user=<github username>"
```

where `runmode` is one of `dev` , `test` or `prod`,
Expand Down
42 changes: 30 additions & 12 deletions specs/grid3/readme.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,19 @@
# Introduction

This document is about tf-grid 3.0. It shows an overview of the grid regarding operation, and how different components communicate to each other.

## Definitions

- 3node: a machine that runs zos operating system.
- RMB: reliable message bus
- grid-db: a decentralized block chain database that allows nodes and other twins to share trusted data. Anyone can look up nodes and verify their identity, find their corresponding twin IDs to communicate over RMB.

# Overview
## Overview

![Overlay](png/grid3-overlay.png)

The operation can be described as follows:

- Everything that needs to talk to other components should live on the yggdrasil network.
- Nodes and users have to create a “twin” object on GridDB which is associated with an Yggdrasil IP. Then to communicate with any twin, the IP can be looked up using the twin id. This is basically how RMB works.
- When starting for the first time, the node needs to register itself on the grid database, which is a decentralized database built on top of substrate. The registration need to have information about:
Expand All @@ -27,7 +32,9 @@ The operation can be described as follows:
- Node will also send consumption reports to the contract, the contract then can start billing the user.

![Sequence Diagram](png/sequence.png)
# 3node

## 3node

On first boot the node needs to create a “twin” on the grid, a twin associated with a public key. Hence it can create verifiable signed messages.
Then, a node will then register itself as a “node” on the grid, and links this twin to the node object.

Expand All @@ -36,16 +43,20 @@ Responses from the node **MUST** be signed with the twin secret key, a user then

Node uses the same verification mechanism against requests from twins.

# Concerns
## Concerns

- Stellar - Substrate token bridge. How should we do this? Transaction on the grid db will cost tokens (economic protection)
- Deployments are not migrateable from a node to another. If a node is not reachable anymore or down. It’s up to the owner to delete his contract and recreate it somewhere else.

# Grid DB
## Definitions
## Grid DB

### Definitions

- `entity`: this represents a legal entity or a person, the entity is the public key of the user associated with name, country, and a unique identifier.
- `twin`: represents the management interface that can be accessed over the yggdrasil ipv6 address. A twin is associated also with a single public key.

On the grid, we build on top of the above concepts a more sophisticated logics to represent the following: (note, full types specifications can be found here)

- farm: a farm is associated
- entity-id: this defines the entity of the farm itself, so it holds information about the country, city, etc and public key.
- twin-id: the communication endpoint for this farm.
Expand All @@ -55,6 +66,7 @@ On the grid, we build on top of the above concepts a more sophisticated logics t
- farm-id: unique farm id this node is part of.

Grid database is a decentralized blockchain database that leverage on substrate to provide API that can be used to

- Create Entities
- Create Twins
- Add / Remove entities from twins
Expand All @@ -66,38 +78,44 @@ Grid database is a decentralized blockchain database that leverage on substrate

A farm has one twin, through this twin the peer_id and entity can be requested. This is also the case for nodes as they have one farm.

# Public IPs
## Public IPs

Public IPs are locked by substrate from the farmer IPs pool. for this to be possible, on contract creation the user need to specify how many IPs he needs to lock up.

If the contract creation is successful. It means the contract managed to look up the required number of Ips. the values for the IPs are going to be available for this specific node.

Once the contract is terminated, the Ips are returned back to the farmer pool.

# Pricing
## Pricing

Each farmer object is assigned a Pricing Policy object:
The pricing policy defines:

- Currency (TFT, USD, etc…) _we will probably drop that_
- Unit (Bytes, Kilobyte, Megabyte, Gigabytes)
- SU
- CU
- NU

## General notes:
## General notes

- Each price defines a price for a single UNIT/SECOND. So for example SU is the price of a single Storage Unit per second where a Storage Unit can be 1 Gigabyte
- NU is an exception because that's the price PER unit (regardless of number of seconds). For example if NU is 10 and Unit ig Gigabyte, so 1 Gigabyte of network traffic costs 10 mill.
- Price is defined as `mil` of the currency. 1 UNIT = 1,000,000,0 mil

### Example:
### Example

Currency: TFT
SU: 1000 (mil tft)

(Price for 1 Gigabyte of ssd storage costs 1000 / 10 000 000 TFT per second)

So let's assume capacity is created at **Time = T**

- Node send first report at **T+s** with SU consumption C = (10 gigabytes)
- price = C/(1024*1024*1024) * s * SU
- price = 10 * s * SU
- price = C/(1024*1024*1024) _s_ SU
- price = 10 _s_ SU
- Assume s is 5 min (300 seconds)
- Then price = 10 * 300 * 1000 = 3000000 mil = 3 TFT
- Then price = 10 _300_ 1000 = 3000000 mil = 3 TFT

Same for each other unit EXCEPT the NU. The NU unit is incremental. It means the node will keep reporting the total amount of bytes consumed since the machine starts. So to correctly calculate the consumption over period S it has to be `now.NU - last.NU`
205 changes: 205 additions & 0 deletions specs/grid4/readme.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,205 @@
# Grid v4 Documentation

## Introduction

This document provides an overview of tf-grid-v4, focusing on how nodes operate and communicate with each other outside of blockchain interactions. Grid v4 is the evolution of Grid3, with significant changes in how nodes register and communicate.

## Definitions

- **Node**: A machine that runs ZOS (Zero-OS) operating system.
- **Registrar**: An HTTP server that handles node registration and version control.
- **Twin**: A digital representation of a node or user on the grid, associated with a public/private key pair.
- **Farm**: A collection of nodes operated by a single entity.

## Key Differences Between Grid3 and Grid4

| Feature | Grid3 | Grid4 |
|---------|-------|-------|
| Registration | Nodes register on blockchain (grid-db) | Nodes register on HTTP server (registrar) |
| Node-Registrar Communication | Uses RMB (Reliable Message Bus) | Uses signed HTTP requests |
| Node-to-Node Communication | Uses RMB (Reliable Message Bus) | Not implemented yet (planned to use Open RPC API) |
| Version Control | Through blockchain | Through registrar server |
| Contract Management | On blockchain | Not fully implemented yet |

## Overview

Grid v4 operates with the following key components:

1. **Node Identity**: Each node has a unique identity based on a public/private key pair.
2. **Registration**: Nodes register with the registrar server through signed HTTP requests.
3. **Version Control**: The registrar server manages version control for nodes.
4. **Workload Deployment**: Users can deploy workloads on nodes.

## Node Registration Process

When starting for the first time, the node needs to register itself on the node registerar following these steps:

1. The `identityd` daemon generates or loads a key pair that represents the node's identity.
- This key pair is used for signing all communications with the registrar server.
- The public key serves as the node's unique identifier.

2. The `registrar_light` module collects node information:
- Which farm it belongs to
- Hardware capacity (CPU, memory, storage, GPU)
- Geographic location (obtained via geoip service)
- Network interfaces (name, MAC address, IPs)
- Hardware details (secure boot status, virtualization status, serial number)

3. The node sends this information to the registrar server through signed HTTP requests via the `registrar_gateway`:
- The request includes an authentication header with the signature.
- The registrar server validates the signature using the node's public key.

4. Account creation and management:
- If this is the first time the node connects, it creates a new account (twin) on the registrar.
- If the node already has an account, it ensures the account is active.
- The registrar assigns a twin ID to the node.

5. Node registration:
- The node registers itself with its farm ID, twin ID, resources, location, and interfaces.
- The registrar assigns a node ID to the node.
- If the node was previously registered, it updates its information.

6. Periodic updates:
- The node periodically checks its account status (every 30 minutes).
- The node updates its information on the registrar server (every 24 hours).
- The node updates uptime on the registrar server (every 40 hours).
- If the node's network address changes, it immediately re-registers.

- Once identity has been established, secure and trusted communication can be established between the different parties.

## Node Architecture

The node runs several core modules that work together:

1. **identityd**: Manages node identity and cryptographic operations.
2. **registrar_light**: Handles node registration with the registrar server.
3. **noded**: Reports node resources and monitors system health.
4. **provisiond**: Manages workload deployments.
5. **storaged**: Handles disk and volume management.
6. **netlightd**: Manages network resources.
7. **vmd**: Manages virtual machines.
8. **contd**: Handles container deployments.
9. **flistd**: Manages file system mounts.
10. **powerd**: Manages power state.

## Version Control and Upgrades

Grid v4 implements a sophisticated version control system:

1. **Version Management via Registrar**:
- The registrar server maintains the current approved version of ZOS.
- The registrar provides a `ZosVersion` object that includes:
- The current approved version string
- A `SafeToUpgrade` flag that controls rollout
- A list of test farms for A/B testing

2. **Update Detection**:
- The `upgrade` package in the node checks for updates periodically (every 60 minutes with jitter).
- The node compares its current version with the version from the registrar.
- If versions differ and the `SafeToUpgrade` flag is true (or the node is on one of the predefined test farm), the update process begins.

3. **Update Process**:
- Updates are fetched from a hub server as flist packages.
- The node first downloads and installs all dependency packages.
- The ZOS package itself is updated last to ensure all dependencies are in place.
- The update process is atomic - either all packages are updated successfully, or none are.

4. **Safe Update Mechanism**:
- The node uses a multi-stage bootstrap process to ensure reliable updates.
- Updates are applied in a way that prevents interruption during critical operations.
- If an update fails, the node can roll back to the previous working state.
- The registrar can enable A/B testing by setting specific farms as test farms.
- Test farms receive updates first, allowing for validation before wider deployment.
- The `SafeToUpgrade` flag controls whether non-test farms should update.
- This allows for gradual rollout of updates across the grid.

## Workload Provisioning

The provisioning system in Grid v4 handles the deployment of workloads

This is not fully implemented yet

1. **Provisioning Engine**:
- The `provision` engine (`provisiond` module) manages the lifecycle of all workloads.
- It uses a queue-based system to process workload operations in the correct order.
- The engine maintains a persistent storage of all deployments and their states.

2. **Workload Types**:
- Workloads are defined as deployments with specific types:
- `ZMountType`: File system mounts
- `VolumeType`: Storage volumes
- `QuantumSafeFSType`: Secure file systems
- `NetworkLightType`: Network configurations
- `PublicIPv4Type`: Public IPv4 addresses
- `ZMachineLightType`: Virtual machines
- `ZLogsType`: Log forwarding

3. **Workload Operations**:
- The engine processes several types of operations:
- `Provision`: Deploy a new workload
- `Deprovision`: Remove an existing workload
- `Update`: Modify an existing workload
- `Pause`: Temporarily suspend a workload
- `Resume`: Reactivate a suspended workload

4. **Type Managers**:
- Each workload type has a dedicated manager that implements:
- `Provision`: Create the workload resources
- `Deprovision`: Clean up the workload resources
- Optional `Update`: Modify the workload without recreating it
- Optional `Pause`/`Resume`: Suspend/resume the workload

5. **Deployment Processing**:
- Workloads are processed in a specific order to ensure dependencies are met:
- Storage volumes are created first
- Networks are configured next
- VMs and containers are deployed last
- This ordering ensures that resources required by VMs/containers are available when needed.

6. **State Management**:
- The provisioning system maintains the state of all deployments:
- `StateOk`: Workload is running correctly
- `StateError`: Workload deployment failed
- `StatePaused`: Workload is temporarily suspended
- `StateDeleted`: Workload has been removed
- Each workload result includes the state, creation timestamp, and any error messages.

## Communication Flow

1. **Client Interaction**:
- Users interact with the grid through client tools or APIs.
- The primary interface is zos-api-light.

3. **Node-Registrar Communication**:
- Nodes communicate with the registrar server through signed HTTP requests.
- The registrar validates the signature using the node's public key.
- This replaces the RMB (Reliable Message Bus) used in Grid3 for registration.

4. **Node-to-Node Communication**:
- Direct node-to-node communication is not yet implemented in Grid v4.
- Future versions will implement an Open RPC API for node-to-node communication.
- This will replace the RMB used in Grid3 for peer communication.

5. **Inter-Module Communication**:
- Within a node, modules communicate through a message bus (zbus) using Redis.
- This provides a reliable and efficient way for modules to interact.
- Each module exposes a set of methods that can be called by other modules.

6. **Status Reporting**:
- Nodes periodically report their status to the registrar server.
- This includes uptime, resource usage, and health information.
- The registrar uses this information to maintain an up-to-date view of the grid.

## Future Developments

While Grid v4 is operational, some components are still under development:

1. **Node-to-Node Communication**:
- Direct node-to-node communication is not yet implemented.
- A new Open RPC API will be developed to replace the RMB used in Grid3.
- This will enable peer-to-peer communication between nodes for distributed workloads.

2. **Deployments and Contract Management**:
- Contract management is not fully implemented yet.
- This will provide a way to manage agreements between users and nodes.
- It will include billing, resource allocation, and service level agreements.
Loading