Graphlet-AI · rjurney · Mar 8, 2026 · Mar 8, 2026 · gemini-code-assist · Mar 8, 2026
diff --git a/docs/SERF_LONG_SHOT_PLAN.md b/docs/SERF_LONG_SHOT_PLAN.md
@@ -155,19 +155,21 @@ Abzu's codebase at `/Users/rjurney/Software/weave` contains the proven ER patter
 14. **Cross-iteration UUID validation** -- Evaluation validates source_uuids against ALL historical UUIDs from all previous iterations, not just the current round
 15. **Auto-scaling target block size** -- `effective_target = max(10, target_block_size // iteration)` creates tighter clusters in later rounds when remaining duplicates are harder to find
 16. **UDTF factory pattern** -- `create_split_large_blocks_udtf()` dynamically creates PySpark UDTF classes with configurable return types and max block sizes
-17. **SparkDantic schema bridge** -- `SparkModel` subclasses of BAML Pydantic types automatically generate Spark schemas with IntegerType-to-LongType conversion
+17. **SparkDantic schema bridge** -- `SparkModel` subclasses of Pydantic types automatically generate Spark schemas with IntegerType-to-LongType conversion. In SERF, this works bidirectionally: Pydantic-to-Spark for writing, and Spark-to-Pydantic for auto-generating entity types from input DataFrames
 18. **Comprehensive analysis reports** -- 7-section blocking reports with input data, clustering params, distance distribution (percentiles), block size stats, Levenshtein analysis, sample blocks, and recommendations
 19. **Per-stage timing and throughput** -- CLI tracks wall-clock time and throughput (companies/sec, blocks/sec) per pipeline stage, plus cross-iteration summary table
 20. **Company name normalization** -- `cleanco` library for corporate suffix removal, multilingual stop word filtering for acronym generation, domain suffix removal for blocking keys
 
 ### 3.6 Patterns to Evolve in SERF
 
-1. **BAML -> DSPy signatures** -- Replace BAML templates with DSPy `Signature` classes + `BAMLAdapter` for output formatting
-2. **Poetry -> uv** -- Replace Poetry with the faster, standards-compliant uv package manager
-3. **PySpark 3.5 -> 4.1** -- Leverage VARIANT type, Spark Connect, Python Data Source API
-4. **Parquet -> Iceberg** -- Add ACID transactions, time travel, schema evolution for iterative ER
-5. **Company-only -> domain-agnostic** -- Generalize entity types beyond companies
-6. **Manual orchestration -> agentic** -- DSPy ReAcT agents control the pipeline dynamically
+1. **BAML types -> fresh DSPy Pydantic types** -- Do NOT reuse Abzu's BAML-generated types (`abzu.baml_client.types`). Those types were auto-generated by BAML for a company-specific domain. SERF needs fresh, domain-agnostic Pydantic classes designed for DSPy signatures. Study Abzu's types for field patterns and ER metadata (source_ids, source_uuids, match_skip, match_skip_history), but build new classes from scratch.
+2. **BAML templates -> DSPy signatures** -- Replace BAML templates with DSPy `Signature` classes + `BAMLAdapter` for output formatting. The signatures should use the new Pydantic types as input/output fields.
+3. **Auto-generate entity types from DataFrames** -- When a user provides a PySpark DataFrame (or Parquet/CSV file), SERF should automatically infer Pydantic entity types from the DataFrame schema. This means inspecting `df.schema` (StructType), mapping Spark types to Python types, and generating a Pydantic class with the appropriate fields. The profiler (Section 5.3) identifies which fields are names, identifiers, dates, etc. -- this metadata enriches the generated type with DSPy field descriptions. Use this approach when it simplifies the user experience (e.g., `serf resolve --input data.parquet` should work without the user defining any types). For advanced use cases, users can define their own Pydantic entity types.
+4. **Poetry -> uv** -- Replace Poetry with the faster, standards-compliant uv package manager
+5. **PySpark 3.5 -> 4.1** -- Leverage VARIANT type, Spark Connect, Python Data Source API
+6. **Parquet -> Iceberg** -- Add ACID transactions, time travel, schema evolution for iterative ER
+7. **Company-only -> domain-agnostic** -- Generalize entity types beyond companies
+8. **Manual orchestration -> agentic** -- DSPy ReAcT agents control the pipeline dynamically
 
 ---
 
@@ -268,7 +270,8 @@ serf/
         main.py                    # CLI entry point (exists, extend)
       dspy/
         __init__.py
-        types.py                   # Pydantic entity types (exists, extend)
+        types.py                   # Fresh Pydantic entity types for DSPy (exists, rewrite)
+        type_generator.py          # Auto-generate entity types from DataFrame schemas (NEW)
         baml_adapter.py            # BAMLAdapter for DSPy (exists)
         signatures.py              # DSPy signatures for ER (NEW)
         agents.py                  # DSPy ReAcT agents (NEW)
@@ -306,6 +309,7 @@ serf/
   tests/
     test_config.py
     test_types.py
+    test_type_generator.py
     test_baml_adapter.py
     test_embeddings.py
     test_faiss_blocker.py
@@ -337,13 +341,19 @@ serf/
 
 ## 5. Data Model and Type System
 
+**Important: SERF does NOT reuse Abzu's BAML-generated types.** Abzu's types were auto-generated by BAML for a company-specific SEC filing domain. SERF builds fresh Pydantic types designed for DSPy, preserving only the proven ER metadata patterns (source_ids, source_uuids, match_skip, match_skip_history).
+
 ### 5.1 Domain-Agnostic Entity Types
 
 SERF should generalize beyond Abzu's Company-only model. The base entity type is inspired by [schema.org](https://schema.org/Person) property conventions:
 
 ```python
 class Entity(BaseModel):
-    """Base entity type for all resolvable entities."""
+    """Base entity type for all resolvable entities.
+
+    Domain-specific fields live in `attributes` or in subclasses.
+    ER metadata fields (id, uuid, source_ids, etc.) are fixed across all domains.
+    """
     id: int
     uuid: Optional[str] = None
     name: str
@@ -357,7 +367,7 @@ class Entity(BaseModel):
     match_skip_history: Optional[list[int]] = None  # iterations that skipped this entity
 ```
 
-Specialized entity types extend this base:
+Specialized entity types extend this base. These are **examples** -- users can define their own or let SERF auto-generate them from DataFrame schemas:
 
 ```python
 class Company(Entity):
@@ -385,6 +395,51 @@ class Person(Entity):
     nationality: Optional[str] = None
 ```
 
+### 5.1.1 Auto-Generating Entity Types from DataFrames
+
+When the user provides a PySpark DataFrame (or Parquet/CSV file) without a custom entity type, SERF should auto-generate a Pydantic entity class from the DataFrame schema:
+
+```python
+def entity_type_from_spark_schema(
+    schema: StructType,
+    profile: DatasetProfile,
+    entity_type_name: str = "AutoEntity",
+) -> type[Entity]:
+    """Generate a Pydantic Entity subclass from a Spark StructType schema.
+
+    Uses the DatasetProfile to enrich fields with DSPy descriptions
+    (e.g., marking a field as "name", "identifier", "date").
+
+    Parameters
+    ----------
+    schema : StructType
+        The Spark schema to convert
+    profile : DatasetProfile
+        Profiling results identifying field types and roles
+    entity_type_name : str
+        Name for the generated class
+
+    Returns
+    -------
+    type[Entity]
+        A dynamically created Pydantic subclass of Entity
+    """
+    ...
+```
+
+The type mapping from Spark to Python is straightforward:
+
+| Spark Type            | Python Type           | Notes                       |
+| --------------------- | --------------------- | --------------------------- |
+| StringType            | str                   | Default for most fields     |
+| LongType/IntegerType  | int                   | Numeric identifiers, counts |
+| DoubleType/FloatType  | float                 | Revenue, percentages        |
+| BooleanType           | bool                  | Flags                       |
+| ArrayType(StringType) | list[str]             | Tags, categories            |
+| StructType            | nested Pydantic model | Auto-generated recursively  |
+
+ER metadata fields (id, uuid, source_ids, source_uuids, match_skip, match_skip_reason, match_skip_history) are automatically added by the framework -- the user's schema only needs to contain domain fields. The profiler's `FieldProfile` provides DSPy-compatible descriptions for each field.
+
 ### 5.2 Block and Match Types
 
 ```python
@@ -597,7 +652,7 @@ Raw Entities -> Embed (Qwen3) -> FAISS IVF Cluster -> Blocks
 
 ### 7.3 Phase 2: Schema Alignment + Matching + Merging
 
-All three operations in a single DSPy signature:
+All three operations in a single DSPy signature. The `schema_info` field is auto-generated from the Pydantic entity type (which itself may have been auto-generated from the input DataFrame schema via Section 5.1.1):
 
 ```python
 class BlockMatch(dspy.Signature):
@@ -611,7 +666,7 @@ class BlockMatch(dspy.Signature):
     5. Return ALL entities (merged + non-matched)
     """
     block_records: str = dspy.InputField(desc="JSON array of entity records in this block")
-    schema_info: str = dspy.InputField(desc="Description of the record schema and field meanings")
+    schema_info: str = dspy.InputField(desc="Auto-generated description of entity fields and their roles from the Pydantic type and DatasetProfile")
     few_shot_examples: str = dspy.InputField(desc="Examples of correct merge behavior")
     resolution: BlockResolution = dspy.OutputField()
 ```
@@ -790,6 +845,26 @@ After 3 rounds with merge factor 0.8 per round:
 - Comparison pairs shrink to 0.8^6 = 26.2% of original
 - Each round is cheaper than the last due to smaller dataset
 
+### 9.6 Overnight Build Budget Constraint
+
+**Hard budget: $100 total Gemini API spend for the overnight build.**
+
+A `GEMINI_API_KEY` environment variable will be provided. The agent must stay within budget by following these rules:
+
+1. **Use Gemini 2.0 Flash exclusively** for all ER pipeline operations (blocking analysis, matching, merging, edge resolution). At $0.10/$0.40 per 1M input/output tokens, this allows ~160M+ input tokens -- more than enough for iterative ER across all three benchmark datasets.
+
+2. **Gemini 2.5 Pro is allowed ONLY for generating validation data** -- high-quality labeled match/non-match pairs and few-shot examples that will be used to evaluate and optimize the pipeline. Limit Gemini 2.5 Pro to **fewer than 2,000 API calls** total. At ~2,500 tokens per call with $1.25/$10.00 per 1M input/output tokens, 2K calls costs roughly $50 -- leaving ample headroom for Flash usage.
+
+3. **Never use Claude, GPT-4o, or any non-Gemini model** for pipeline operations during the build. The DSPy signatures and pipeline code should be model-agnostic, but all actual LLM calls during this build session must go through Gemini.
+
+4. **Track token usage** by logging input/output token counts from API responses. If cumulative spend approaches $80, stop making Gemini 2.5 Pro calls and finish remaining work with Flash only.
+
+| Use Case                       | Model            | Max Calls                 | Est. Cost  |
+| ------------------------------ | ---------------- | ------------------------- | ---------- |
+| ER pipeline (match/merge/edge) | Gemini 2.0 Flash | Unlimited (within budget) | ~$10-30    |
+| Validation data generation     | Gemini 2.5 Pro   | < 2,000                   | ~$50       |
+| **Total**                      |                  |                           | **< $100** |
+
 ---
 
 ## 10. Implementation Plan
@@ -804,13 +879,14 @@ The following ordered steps should be executed by the Cursor Agent. Each step pr
 4. Create all module directories with `__init__.py` files
 5. Update `CLAUDE.md` to reflect new tooling (uv, Ruff)
 
-### Step 2: Core Type System (1 hr)
+### Step 2: Core Type System (1.5 hr)
 
-1. Rewrite `src/serf/dspy/types.py` with domain-agnostic `Entity` base class + `Company`, `Person` specializations
+1. Rewrite `src/serf/dspy/types.py` from scratch with fresh, DSPy-appropriate Pydantic types -- do NOT copy Abzu's BAML-generated types. Build domain-agnostic `Entity` base class with ER metadata (id, uuid, source_ids, source_uuids, match_skip, match_skip_reason, match_skip_history) + example `Company`, `Person` specializations
 2. Add `EntityBlock`, `MatchDecision`, `BlockResolution` types
 3. Add `FieldProfile`, `DatasetProfile` types for dataset analysis
 4. Add `IterationMetrics`, `BlockingMetrics` TypedDicts
-5. Write exhaustive unit tests: `tests/test_types.py`
+5. Create `src/serf/dspy/type_generator.py` -- `entity_type_from_spark_schema()` that auto-generates a Pydantic Entity subclass from a Spark StructType + DatasetProfile. Maps Spark types to Python types, adds ER metadata fields automatically, generates DSPy field descriptions from profiling results
+6. Write exhaustive unit tests: `tests/test_types.py`, `tests/test_type_generator.py`
 
 ### Step 3: DSPy Signatures and Adapter (1 hr)