Skip to content

Commit 3d6138d

Browse files
authored
Merge pull request #696 from macrocosm-os/feat/more_detailed_vali_logs
Feat/more detailed vali logs
2 parents 1423f94 + 0f6182e commit 3d6138d

File tree

5 files changed

+605
-4
lines changed

5 files changed

+605
-4
lines changed

common/constants.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -45,8 +45,8 @@
4545
# Date after which backwards compatibility for nested X content format will be removed
4646
X_ENHANCED_FORMAT_COMPATIBILITY_EXPIRATION_DATE = dt.datetime(2025, 9, 30, tzinfo=dt.timezone.utc) # September 30, 2025 UTC
4747

48-
# Date after which enhanced metadata completeness validation is required for organic outputs
49-
ENHANCED_METADATA_REQUIRED_DATE = dt.datetime(2025, 9, 23, tzinfo=dt.timezone.utc) # September 23, 2025 UTC
48+
# Date after which filename format validation is enforced (data_{YYYYMMDD_HHMMSS}_{count}_{16hex}.parquet)
49+
FILENAME_FORMAT_REQUIRED_DATE = dt.datetime(2025, 12, 2, tzinfo=dt.timezone.utc) # December 2, 2025 UTC (1 week from Nov 25, 2025)
5050

5151
EVALUATION_ON_STARTUP = 15
5252

docs/s3_validation.md

Lines changed: 247 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,247 @@
1+
# S3 Storage & Validation
2+
3+
## System Architecture
4+
5+
Miners upload scraped data to S3 via presigned URLs obtained from an auth server. Validators retrieve and validate this data through pagination-supported S3 access, performing comprehensive checks on format, content, and quality.
6+
7+
---
8+
9+
## Filename Format Specification
10+
11+
**Required Format:**
12+
```
13+
data_{YYYYMMDD_HHMMSS}_{record_count}_{16_char_hex}.parquet
14+
```
15+
16+
**Components:**
17+
- `YYYYMMDD`: Date (e.g., `20250804`)
18+
- `HHMMSS`: Time (e.g., `150058`)
19+
- `record_count`: Integer representing actual row count in file
20+
- `16_char_hex`: 16-character hexadecimal string
21+
22+
**Enforcement Date:** December 2, 2025
23+
24+
---
25+
26+
## S3 Storage Structure
27+
28+
```
29+
s3://bucket/data/hotkey={miner_hotkey}/job_id={job_id}/data_{timestamp}_{count}_{hash}.parquet
30+
```
31+
32+
Files are organized by miner hotkey and job ID. Each job corresponds to a specific data collection task (e.g., scraping Reddit politics, X bittensor hashtag).
33+
34+
---
35+
36+
## Upload Flow
37+
38+
### Miner Side
39+
40+
1. **Authentication**
41+
- Request presigned URL from auth server (`/get-folder-access`)
42+
- Provide signature: `s3:data:access:{coldkey}:{hotkey}:{timestamp}`
43+
- Receive credentials with folder path
44+
45+
2. **Data Preparation**
46+
- Create parquet file from scraped data
47+
- Generate filename: `data_{timestamp}_{len(df)}_{secrets.token_hex(8)}.parquet`
48+
- Record count must match actual DataFrame length
49+
50+
3. **Upload**
51+
- Construct S3 path: `job_id={job_id}/{filename}`
52+
- Upload via presigned URL with form fields
53+
- Auth server handles actual S3 credentials
54+
55+
**Implementation:** `upload_utils/s3_uploader.py`, `upload_utils/s3_utils.py`
56+
57+
---
58+
59+
## Validation Flow
60+
61+
### Validator Side
62+
63+
1. **File Discovery** (Step 1)
64+
- List all files for miner via `list_all_files_with_metadata(miner_hotkey)`
65+
- Supports pagination (1000 files per page, DigitalOcean Spaces limit)
66+
- Extract metadata: file path, size, last modified time
67+
68+
2. **Filename Format Check** (Step 1)
69+
- Validate each filename against pattern: `data_\d{8}_\d{6}_\d+_[a-fA-F0-9]{16}\.parquet$`
70+
- Collect invalid filenames
71+
- **After Dec 2, 2025:** Fail validation if any invalid filenames found
72+
73+
3. **Dashboard Metrics Extraction** (Step 1)
74+
- Parse filenames to extract claimed record counts
75+
- Sum across all files: `total_claimed_records`
76+
- Log for monitoring and dashboard display
77+
78+
4. **Job Identification** (Step 1)
79+
- Extract job IDs from file paths
80+
- Match against expected jobs from Gravity
81+
- Calculate job completion rate
82+
83+
5. **Content Sampling** (Step 4)
84+
- Download sample files via presigned URLs
85+
- Load parquet files into DataFrames
86+
87+
6. **Record Count Validation** (Step 4)
88+
- Extract claimed count from filename
89+
- Compare with actual: `len(df)`
90+
- **After Dec 2, 2025:** Track mismatches, fail validation if any found
91+
92+
7. **Quality Checks** (Steps 4-6)
93+
- **Duplicate Detection:** Check for duplicate URIs across files
94+
- **Job Content Matching:** Verify data matches job requirements (hashtags, subreddits, etc.)
95+
- **Scraper Validation:** Re-scrape sample URIs to verify authenticity
96+
97+
**Implementation:** `vali_utils/s3_utils.py` (S3Validator class)
98+
99+
---
100+
101+
## Validation Rules
102+
103+
| Check | Requirement | Enforcement | Implementation |
104+
|-------|-------------|-------------|----------------|
105+
| Filename Format | `data_YYYYMMDD_HHMMSS_count_16hex.parquet` | Dec 2, 2025 | `is_valid_filename_format()` |
106+
| Record Count | Claimed count = `len(df)` | Dec 2, 2025 | `extract_count_from_filename()` |
107+
| Duplicate Rate | ≤10% | Active | URI deduplication across samples |
108+
| Scraper Success | ≥80% | Active | Re-scrape sampled entities |
109+
| Job Match Rate | ≥95% | Active | Content matching job criteria |
110+
111+
### Validation Logic
112+
113+
```python
114+
# Step 1: Format check
115+
if invalid_filenames and now >= FILENAME_FORMAT_REQUIRED_DATE:
116+
return _create_failed_result("Invalid filename format")
117+
118+
# Step 4: Count check (during duplicate detection)
119+
if claimed_count != actual_count and now >= FILENAME_FORMAT_REQUIRED_DATE:
120+
track_mismatch()
121+
122+
# After Step 4: Check tracked mismatches
123+
if count_mismatches and now >= FILENAME_FORMAT_REQUIRED_DATE:
124+
return _create_failed_result("Record count validation failed")
125+
126+
# Continue to content validation
127+
if duplicate_percentage > 10%:
128+
return _create_failed_result("Too many duplicates")
129+
130+
if scraper_success_rate < 80%:
131+
return _create_failed_result("Low scraper success")
132+
133+
if job_match_rate < 95%:
134+
return _create_failed_result("Poor job content match")
135+
```
136+
137+
---
138+
139+
## Scoring & Incentives
140+
141+
### Validation as Gate
142+
143+
Validation acts as a binary gate. All checks must pass for a miner to receive any rewards:
144+
145+
```
146+
IF validation_passes:
147+
miner_score = calculated_score
148+
ELSE:
149+
miner_score = 0
150+
```
151+
152+
### Score Calculation
153+
154+
For miners passing validation:
155+
156+
```
157+
raw_score = data_type_scale_factor × time_scalar × scorable_bytes
158+
credibility_boost = credibility^2.5
159+
job_completion_multiplier = active_jobs / expected_jobs
160+
161+
final_score = raw_score × credibility_boost × job_completion_multiplier
162+
```
163+
164+
### Reward Distribution
165+
166+
```
167+
miner_reward = (miner_score / Σ(all_miner_scores)) × total_reward_pool
168+
```
169+
170+
### Key Parameters
171+
172+
- **Data Type Weights:** Reddit: 0.55, X: 0.35, YouTube: 0.10
173+
- **Credibility Exponent:** 2.5
174+
- **Min Evaluation Period:** 60 minutes
175+
- **Data Age Limit:** 30 days (or job-specific range)
176+
177+
**See:** `docs/scoring.md` for detailed scoring mechanism
178+
179+
---
180+
181+
## Impact Analysis
182+
183+
### Without Filename Validation
184+
185+
- Miners can inflate metrics with incorrect counts
186+
- Dashboard shows inaccurate data volumes
187+
- No way to verify claimed record counts without downloading all files
188+
- Potential for gaming through filename manipulation
189+
190+
### With Filename Validation
191+
192+
- Record counts verifiable without downloading (efficient)
193+
- Dashboard metrics accurate for planning and analysis
194+
- Fraud detection via count mismatch warnings
195+
- Enforced standards improve data quality
196+
197+
### Example Scenario
198+
199+
**Miner claims large dataset:**
200+
```
201+
Filename: data_20250804_150058_999999_4yk9nu3ghiqjmv6c.parquet
202+
Claimed: 999,999 records
203+
Actual: 100 records
204+
```
205+
206+
**Before Dec 2, 2025:** Warning logged, no penalty
207+
**After Dec 2, 2025:** Validation fails, miner_score = 0, reward = 0
208+
209+
---
210+
211+
## Implementation Reference
212+
213+
### Miner Implementation
214+
- **Upload Logic:** `upload_utils/s3_uploader.py` (lines 472-515)
215+
- **Auth & Credentials:** `upload_utils/s3_utils.py`
216+
- **Filename Generation:** Line 487-490
217+
218+
### Validator Implementation
219+
- **S3 Access:** `vali_utils/validator_s3_access.py`
220+
- **Validation Logic:** `vali_utils/s3_utils.py` (S3Validator class)
221+
- **Helper Functions:** Lines 93-131 (`extract_count_from_filename`, `is_valid_filename_format`)
222+
- **Format Validation:** Lines 186-200
223+
- **Count Validation:** Lines 513-544, 256-263
224+
225+
### Configuration
226+
- **Enforcement Date:** `common/constants.py` line 49
227+
- **Thresholds:** `vali_utils/s3_utils.py` (duplicate: 10%, scraper: 80%, job_match: 95%)
228+
229+
---
230+
231+
## Technical Notes
232+
233+
### Pagination Support
234+
235+
Validators handle large datasets through pagination:
236+
- S3 list operations limited to 1000 objects per page
237+
- Continuation tokens used for multi-page retrieval
238+
- Total file counts can exceed 10,000+ per miner
239+
240+
**Fix Applied:** PR #690 corrected pagination bug where continuation tokens were mishandled
241+
242+
243+
### Performance Optimizations
244+
245+
- Filename validation performed on metadata only (no downloads)
246+
- Record count validation only on sampled files (duplicate check)
247+
- Dashboard metrics extracted via regex on file paths (O(n) complexity)

vali_utils/miner_evaluator.py

Lines changed: 14 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -33,6 +33,7 @@
3333
from typing import List, Optional, Tuple
3434
from vali_utils.validator_s3_access import ValidatorS3Access
3535
from vali_utils.s3_utils import validate_s3_miner_data, get_s3_validation_summary, S3ValidationResult
36+
from vali_utils.s3_logging_utils import log_s3_validation_table, log_s3_validation_compact
3637

3738
from rewards.miner_scorer import MinerScorer
3839

@@ -298,10 +299,21 @@ async def _perform_s3_validation(
298299
use_enhanced_validation=True, config=self.config, s3_reader=self.s3_reader
299300
)
300301

301-
# Log results
302+
# Log results with rich table
302303
summary = get_s3_validation_summary(s3_validation_result)
303304
bt.logging.info(f"{hotkey}: {summary}")
304-
305+
306+
# Display rich table with detailed metrics
307+
try:
308+
log_s3_validation_table(
309+
result=s3_validation_result,
310+
uid=uid,
311+
hotkey=hotkey,
312+
pagination_stats=None # Could add pagination stats if available
313+
)
314+
except Exception as e:
315+
bt.logging.debug(f"Error displaying S3 validation table: {e}")
316+
305317
if not s3_validation_result.is_valid and s3_validation_result.issues:
306318
bt.logging.debug(f"{hotkey}: S3 validation issues: {', '.join(s3_validation_result.issues[:3])}")
307319

0 commit comments

Comments
 (0)