Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
106 changes: 91 additions & 15 deletions docs/SERF_LONG_SHOT_PLAN.md
Original file line number Diff line number Diff line change
Expand Up @@ -155,19 +155,21 @@ Abzu's codebase at `/Users/rjurney/Software/weave` contains the proven ER patter
14. **Cross-iteration UUID validation** -- Evaluation validates source_uuids against ALL historical UUIDs from all previous iterations, not just the current round
15. **Auto-scaling target block size** -- `effective_target = max(10, target_block_size // iteration)` creates tighter clusters in later rounds when remaining duplicates are harder to find
16. **UDTF factory pattern** -- `create_split_large_blocks_udtf()` dynamically creates PySpark UDTF classes with configurable return types and max block sizes
17. **SparkDantic schema bridge** -- `SparkModel` subclasses of BAML Pydantic types automatically generate Spark schemas with IntegerType-to-LongType conversion
17. **SparkDantic schema bridge** -- `SparkModel` subclasses of Pydantic types automatically generate Spark schemas with IntegerType-to-LongType conversion. In SERF, this works bidirectionally: Pydantic-to-Spark for writing, and Spark-to-Pydantic for auto-generating entity types from input DataFrames
18. **Comprehensive analysis reports** -- 7-section blocking reports with input data, clustering params, distance distribution (percentiles), block size stats, Levenshtein analysis, sample blocks, and recommendations
19. **Per-stage timing and throughput** -- CLI tracks wall-clock time and throughput (companies/sec, blocks/sec) per pipeline stage, plus cross-iteration summary table
20. **Company name normalization** -- `cleanco` library for corporate suffix removal, multilingual stop word filtering for acronym generation, domain suffix removal for blocking keys

### 3.6 Patterns to Evolve in SERF

1. **BAML -> DSPy signatures** -- Replace BAML templates with DSPy `Signature` classes + `BAMLAdapter` for output formatting
2. **Poetry -> uv** -- Replace Poetry with the faster, standards-compliant uv package manager
3. **PySpark 3.5 -> 4.1** -- Leverage VARIANT type, Spark Connect, Python Data Source API
4. **Parquet -> Iceberg** -- Add ACID transactions, time travel, schema evolution for iterative ER
5. **Company-only -> domain-agnostic** -- Generalize entity types beyond companies
6. **Manual orchestration -> agentic** -- DSPy ReAcT agents control the pipeline dynamically
1. **BAML types -> fresh DSPy Pydantic types** -- Do NOT reuse Abzu's BAML-generated types (`abzu.baml_client.types`). Those types were auto-generated by BAML for a company-specific domain. SERF needs fresh, domain-agnostic Pydantic classes designed for DSPy signatures. Study Abzu's types for field patterns and ER metadata (source_ids, source_uuids, match_skip, match_skip_history), but build new classes from scratch.
2. **BAML templates -> DSPy signatures** -- Replace BAML templates with DSPy `Signature` classes + `BAMLAdapter` for output formatting. The signatures should use the new Pydantic types as input/output fields.
3. **Auto-generate entity types from DataFrames** -- When a user provides a PySpark DataFrame (or Parquet/CSV file), SERF should automatically infer Pydantic entity types from the DataFrame schema. This means inspecting `df.schema` (StructType), mapping Spark types to Python types, and generating a Pydantic class with the appropriate fields. The profiler (Section 5.3) identifies which fields are names, identifiers, dates, etc. -- this metadata enriches the generated type with DSPy field descriptions. Use this approach when it simplifies the user experience (e.g., `serf resolve --input data.parquet` should work without the user defining any types). For advanced use cases, users can define their own Pydantic entity types.
4. **Poetry -> uv** -- Replace Poetry with the faster, standards-compliant uv package manager
5. **PySpark 3.5 -> 4.1** -- Leverage VARIANT type, Spark Connect, Python Data Source API
6. **Parquet -> Iceberg** -- Add ACID transactions, time travel, schema evolution for iterative ER
7. **Company-only -> domain-agnostic** -- Generalize entity types beyond companies
8. **Manual orchestration -> agentic** -- DSPy ReAcT agents control the pipeline dynamically

---

Expand Down Expand Up @@ -268,7 +270,8 @@ serf/
main.py # CLI entry point (exists, extend)
dspy/
__init__.py
types.py # Pydantic entity types (exists, extend)
types.py # Fresh Pydantic entity types for DSPy (exists, rewrite)
type_generator.py # Auto-generate entity types from DataFrame schemas (NEW)
baml_adapter.py # BAMLAdapter for DSPy (exists)
signatures.py # DSPy signatures for ER (NEW)
agents.py # DSPy ReAcT agents (NEW)
Expand Down Expand Up @@ -306,6 +309,7 @@ serf/
tests/
test_config.py
test_types.py
test_type_generator.py
test_baml_adapter.py
test_embeddings.py
test_faiss_blocker.py
Expand Down Expand Up @@ -337,13 +341,19 @@ serf/

## 5. Data Model and Type System

**Important: SERF does NOT reuse Abzu's BAML-generated types.** Abzu's types were auto-generated by BAML for a company-specific SEC filing domain. SERF builds fresh Pydantic types designed for DSPy, preserving only the proven ER metadata patterns (source_ids, source_uuids, match_skip, match_skip_history).

### 5.1 Domain-Agnostic Entity Types

SERF should generalize beyond Abzu's Company-only model. The base entity type is inspired by [schema.org](https://schema.org/Person) property conventions:

```python
class Entity(BaseModel):
"""Base entity type for all resolvable entities."""
"""Base entity type for all resolvable entities.

Domain-specific fields live in `attributes` or in subclasses.
ER metadata fields (id, uuid, source_ids, etc.) are fixed across all domains.
"""
id: int
uuid: Optional[str] = None
name: str
Expand All @@ -357,7 +367,7 @@ class Entity(BaseModel):
match_skip_history: Optional[list[int]] = None # iterations that skipped this entity
```

Specialized entity types extend this base:
Specialized entity types extend this base. These are **examples** -- users can define their own or let SERF auto-generate them from DataFrame schemas:

```python
class Company(Entity):
Expand Down Expand Up @@ -385,6 +395,51 @@ class Person(Entity):
nationality: Optional[str] = None
```

### 5.1.1 Auto-Generating Entity Types from DataFrames

When the user provides a PySpark DataFrame (or Parquet/CSV file) without a custom entity type, SERF should auto-generate a Pydantic entity class from the DataFrame schema:

```python
def entity_type_from_spark_schema(
schema: StructType,
profile: DatasetProfile,
entity_type_name: str = "AutoEntity",
) -> type[Entity]:
"""Generate a Pydantic Entity subclass from a Spark StructType schema.

Uses the DatasetProfile to enrich fields with DSPy descriptions
(e.g., marking a field as "name", "identifier", "date").

Parameters
----------
schema : StructType
The Spark schema to convert
profile : DatasetProfile
Profiling results identifying field types and roles
entity_type_name : str
Name for the generated class

Returns
-------
type[Entity]
A dynamically created Pydantic subclass of Entity
"""
...
```

The type mapping from Spark to Python is straightforward:

| Spark Type | Python Type | Notes |
| --------------------- | --------------------- | --------------------------- |
| StringType | str | Default for most fields |
| LongType/IntegerType | int | Numeric identifiers, counts |
| DoubleType/FloatType | float | Revenue, percentages |
| BooleanType | bool | Flags |
| ArrayType(StringType) | list[str] | Tags, categories |
| StructType | nested Pydantic model | Auto-generated recursively |

ER metadata fields (id, uuid, source_ids, source_uuids, match_skip, match_skip_reason, match_skip_history) are automatically added by the framework -- the user's schema only needs to contain domain fields. The profiler's `FieldProfile` provides DSPy-compatible descriptions for each field.

### 5.2 Block and Match Types

```python
Expand Down Expand Up @@ -597,7 +652,7 @@ Raw Entities -> Embed (Qwen3) -> FAISS IVF Cluster -> Blocks

### 7.3 Phase 2: Schema Alignment + Matching + Merging

All three operations in a single DSPy signature:
All three operations in a single DSPy signature. The `schema_info` field is auto-generated from the Pydantic entity type (which itself may have been auto-generated from the input DataFrame schema via Section 5.1.1):

```python
class BlockMatch(dspy.Signature):
Expand All @@ -611,7 +666,7 @@ class BlockMatch(dspy.Signature):
5. Return ALL entities (merged + non-matched)
"""
block_records: str = dspy.InputField(desc="JSON array of entity records in this block")
schema_info: str = dspy.InputField(desc="Description of the record schema and field meanings")
schema_info: str = dspy.InputField(desc="Auto-generated description of entity fields and their roles from the Pydantic type and DatasetProfile")
few_shot_examples: str = dspy.InputField(desc="Examples of correct merge behavior")
resolution: BlockResolution = dspy.OutputField()
```
Expand Down Expand Up @@ -790,6 +845,26 @@ After 3 rounds with merge factor 0.8 per round:
- Comparison pairs shrink to 0.8^6 = 26.2% of original
- Each round is cheaper than the last due to smaller dataset

### 9.6 Overnight Build Budget Constraint

**Hard budget: $100 total Gemini API spend for the overnight build.**

A `GEMINI_API_KEY` environment variable will be provided. The agent must stay within budget by following these rules:

1. **Use Gemini 2.0 Flash exclusively** for all ER pipeline operations (blocking analysis, matching, merging, edge resolution). At $0.10/$0.40 per 1M input/output tokens, this allows ~160M+ input tokens -- more than enough for iterative ER across all three benchmark datasets.

2. **Gemini 2.5 Pro is allowed ONLY for generating validation data** -- high-quality labeled match/non-match pairs and few-shot examples that will be used to evaluate and optimize the pipeline. Limit Gemini 2.5 Pro to **fewer than 2,000 API calls** total. At ~2,500 tokens per call with $1.25/$10.00 per 1M input/output tokens, 2K calls costs roughly $50 -- leaving ample headroom for Flash usage.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There appears to be an inconsistency in the model names mentioned. This section specifies using Gemini 2.5 Pro for generating validation data, while the 'Core Technologies' table in section 4.1 lists Gemini 2.5 Flash Lite as the lightweight model. For clarity and consistency in the plan, it would be beneficial to harmonize the model names used throughout the document or clarify if different models are intended for different purposes.


3. **Never use Claude, GPT-4o, or any non-Gemini model** for pipeline operations during the build. The DSPy signatures and pipeline code should be model-agnostic, but all actual LLM calls during this build session must go through Gemini.

4. **Track token usage** by logging input/output token counts from API responses. If cumulative spend approaches $80, stop making Gemini 2.5 Pro calls and finish remaining work with Flash only.

| Use Case | Model | Max Calls | Est. Cost |
| ------------------------------ | ---------------- | ------------------------- | ---------- |
| ER pipeline (match/merge/edge) | Gemini 2.0 Flash | Unlimited (within budget) | ~$10-30 |
| Validation data generation | Gemini 2.5 Pro | < 2,000 | ~$50 |
| **Total** | | | **< $100** |

---

## 10. Implementation Plan
Expand All @@ -804,13 +879,14 @@ The following ordered steps should be executed by the Cursor Agent. Each step pr
4. Create all module directories with `__init__.py` files
5. Update `CLAUDE.md` to reflect new tooling (uv, Ruff)

### Step 2: Core Type System (1 hr)
### Step 2: Core Type System (1.5 hr)

1. Rewrite `src/serf/dspy/types.py` with domain-agnostic `Entity` base class + `Company`, `Person` specializations
1. Rewrite `src/serf/dspy/types.py` from scratch with fresh, DSPy-appropriate Pydantic types -- do NOT copy Abzu's BAML-generated types. Build domain-agnostic `Entity` base class with ER metadata (id, uuid, source_ids, source_uuids, match_skip, match_skip_reason, match_skip_history) + example `Company`, `Person` specializations
2. Add `EntityBlock`, `MatchDecision`, `BlockResolution` types
3. Add `FieldProfile`, `DatasetProfile` types for dataset analysis
4. Add `IterationMetrics`, `BlockingMetrics` TypedDicts
5. Write exhaustive unit tests: `tests/test_types.py`
5. Create `src/serf/dspy/type_generator.py` -- `entity_type_from_spark_schema()` that auto-generates a Pydantic Entity subclass from a Spark StructType + DatasetProfile. Maps Spark types to Python types, adds ER metadata fields automatically, generates DSPy field descriptions from profiling results
6. Write exhaustive unit tests: `tests/test_types.py`, `tests/test_type_generator.py`

### Step 3: DSPy Signatures and Adapter (1 hr)

Expand Down