Governance Benchmark Package (Codette ↔ SIM-ONE) — Enhanced Edition

This repository provides a governance-first benchmark for evaluating AI systems, with a focus on truth anchoring, ethical resilience, determinism, and transparency. It includes 30 tasks across three sets (Adversarial Fact Verification, Multi-Step Tool-Augmented Reasoning, Constrained Policy Generation), with structured traces, metrics, and harnesses in both Python and TypeScript.

Transparency & Development History

This repo intentionally preserves earlier iterations (see the /archive/ folder) to document the full development trail. Governance benchmarking requires not only results, but proof of how they were reached. Dead-ends, fixes, recovery attempts, and experimental variants are logged here for reproducibility and auditability.

Canonical files for current usage:

run_harness.py (Python reference runner)

run_harness.ts (TypeScript reference runner)

eval_utils.py (evaluation utilities, stricter hallucination + source prioritization)

error_recovery_handler.py (standard error recovery logic)

fairness_metrics.py (fairness and bias scoring utilities)

tasks.jsonl (task set definitions)

trace_schema.yaml (execution trace schema)

Historical versions are archived under /archive/ for transparency.

Task Sets:

1. Adversarial Fact Verification (AFV)

Stress-tests truth anchoring with mixed-reliability sources.

Each task includes an authoritative_source field identifying the correct reference.

1. Multi-Step Tool-Augmented Reasoning (MSR)

Evaluates deterministic execution with structured inputs and calculations.

1. Constrained Policy Generation (CPG)

Tests ethical adherence and recovery under explicit red-lines.

Metrics:

Determinism Index — stability across identical runs (% identical outputs).

Hallucination Rate (strict) — requires ground truth tokens in outputs.

Source Prioritization Accuracy — AFV: did the system reference the authoritative source?

Reasoning Transparency — presence and clarity of intermediate steps.

Performance Efficiency — latency and cost per task.

Error Recovery Pattern — behavior on near-violations or ambiguous inputs.

Fairness Metrics — bias and distribution checks (fairness_metrics.py).

Usage:

Python python run_harness.py --tasks tasks.jsonl --runs 10 --out results.jsonl python score_results.py results.jsonl

TypeScript ts-node run_harness.ts --tasks tasks.jsonl --runs 10 --out results.jsonl

Replace placeholder calls (simulate_model_call) with actual system invocations.

Changelog Highlights

Added authoritative_source in AFV tasks

Strengthened hallucination checker (token-level)

Added run_harness.ts for Node.js/TypeScript

Merged Enhanced instructions into this unified README

Added fairness_metrics.py

Full details: see CHANGELOG.md

.

Notes

Ground truths are synthetic anchors. Replace/augment with live sources for production.

The benchmark is deliberately small (30 items) for reproducibility. Extend as needed.

Historical versions are preserved in /archive/ for transparency.

@misc{harrison_sasser_codette_governance_benchmark_2025, author = {Jonathan Harrison and Daniel T. Sasser II}, title = {Governance Benchmark Package (Codette ↔ SIM-ONE)}, year = {2025}, publisher = {Raiff's Bits LLC and SIM-ONE}, url = {https://github.com/Raiffs-bits/Collaborative-AGI-Development---Bridging-Architectures-and-Execution} }

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
Archive		Archive
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
Codette_Governance_Benchmark_Scores.csv		Codette_Governance_Benchmark_Scores.csv
EXEC_SUMMARY.md		EXEC_SUMMARY.md
LICENSE		LICENSE
Makefile		Makefile
PARTNER_README.md		PARTNER_README.md
README.md		README.md
Test Simulation		Test Simulation
TestPackage.zip		TestPackage.zip
api test 4		api test 4
benchmark-ci.yml		benchmark-ci.yml
config_recovery.yaml		config_recovery.yaml
error_recovery3.py		error_recovery3.py
error_recovery_handler.py		error_recovery_handler.py
eval_utilsUpdate2.py		eval_utilsUpdate2.py
fairness_metrics2.py		fairness_metrics2.py
gitignore		gitignore
pull_request_template.md		pull_request_template.md
pyproject.toml		pyproject.toml
real API test		real API test
requirements-optional.txt		requirements-optional.txt
requirements.txt		requirements.txt
results_demo.jsonl		results_demo.jsonl
run_harness (1).py		run_harness (1).py
run_harness.ts		run_harness.ts
run_harnessUpdate2.ts		run_harnessUpdate2.ts
run_harness_recovery.py		run_harness_recovery.py
score_results.py		score_results.py
scores_demo.json		scores_demo.json
security_utils2.py		security_utils2.py
tasks.jsonl		tasks.jsonl
tasksUpdate.jsonl		tasksUpdate.jsonl
trace_schema.yaml		trace_schema.yaml
trace_schemaBC.yaml		trace_schemaBC.yaml
trace_schemaUpdate.yaml		trace_schemaUpdate.yaml
validate_tasks.py		validate_tasks.py
visualization_utils.py		visualization_utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Governance Benchmark Package (Codette ↔ SIM-ONE) — Enhanced Edition

Transparency & Development History

Canonical files for current usage:

Contents:

Task Sets:

Metrics:

Usage:

Full details: see CHANGELOG.md

Notes

About

Uh oh!

Releases 1

Packages

Uh oh!

Languages

License

Raiffs-bits/Collaborative-AGI-Development---Bridging-Architectures-and-Execution

Folders and files

Latest commit

History

Repository files navigation

Governance Benchmark Package (Codette ↔ SIM-ONE) — Enhanced Edition

Transparency & Development History

Canonical files for current usage:

Contents:

Task Sets:

Metrics:

Usage:

Full details: see CHANGELOG.md

Notes

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Languages

Packages