Skip to content

Serf long shot plan system#16

Draft
rjurney wants to merge 41 commits intorjurney/v0.1from
cursor/serf-long-shot-plan-system-b0d4
Draft

Serf long shot plan system#16
rjurney wants to merge 41 commits intorjurney/v0.1from
cursor/serf-long-shot-plan-system-b0d4

Conversation

@rjurney
Copy link
Contributor

@rjurney rjurney commented Mar 8, 2026

Implement the complete SERF entity resolution system and establish benchmark baselines to fulfill the long-shot plan and prepare for PyPI.

The system includes conversion to uv, Ruff integration, core modules for blocking, matching, merging, evaluation, analysis, edge resolution, Spark integration, DSPy agents, and a CLI. Baseline F1 scores were established on DBLP-ACM, Abt-Buy, and DBLP-Scholar datasets using embedding similarity.


Open in Web Open in Cursor 

cursoragent and others added 10 commits March 8, 2026 04:34
Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
…te config and module structure

Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
…ipeline with tests

Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
…ger with tests

Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
…tion with tests

Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
…s with tests

Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
…, download commands

Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
…s, fix text column detection

Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
@cursor
Copy link

cursor bot commented Mar 8, 2026

Cursor Agent can help with this pull request. Just @cursor in comments and I'll start working on changes in this branch.
Learn more about Cursor Agents

@gemini-code-assist
Copy link

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces the complete initial implementation of the Semantic Entity Resolution Framework (SERF). It establishes the core architecture for agentic entity resolution, leveraging modern Python tooling and AI frameworks. The changes enable comprehensive data processing, intelligent matching, and robust evaluation, laying the groundwork for future enhancements and PyPI publication.

Highlights

  • Dependency Management Migration: Migrated the project's dependency management and build system from Poetry to uv and Hatchling, updating pyproject.toml and related configuration files.
  • Linting and Formatting Tooling: Replaced black, isort, and flake8 with Ruff for all linting and formatting, streamlining the pre-commit hooks and development workflow.
  • Core Entity Resolution Modules: Implemented foundational modules for the Semantic Entity Resolution Framework (SERF), including semantic blocking (embeddings, FAISS), LLM-based matching and merging, evaluation metrics, dataset analysis, and edge resolution.
  • DSPy Agent Integration: Integrated DSPy agents for dynamic orchestration of the ER pipeline, enabling intelligent control over blocking parameters, matching strategies, and convergence detection.
  • Comprehensive CLI and Benchmarking: Introduced a robust command-line interface (CLI) with commands for analyzing datasets, performing blocking, matching, evaluation, and running full ER pipelines, alongside a system for benchmarking against standard datasets.
  • Spark Integration: Added initial Spark integration components, including Pydantic-to-Spark schema conversion, graph algorithms for connected components, and Iceberg table read/write functionalities.
  • Project Documentation and Licensing: Updated README.md with detailed architecture, quick start guides, and benchmark results. Added an Apache License 2.0 file and a comprehensive BUILD_PLAN.md.
Changelog
  • .gitignore
    • Added new ignore rules for Python build artifacts, virtual environments, uv lock files, and common IDE directories.
  • .pre-commit-config.yaml
    • Replaced black, flake8, and isort hooks with ruff for linting and formatting.
    • Updated zuban hook to use uv run for execution.
  • CLAUDE.md
    • Updated development instructions to reflect the migration to uv and ruff.
    • Replaced references to BAML with DSPy for LLM-related code guidance.
    • Removed the section on BAML Client Generation.
  • LICENSE
    • Added the Apache License 2.0 for the project.
  • README.md
    • Updated the project title and added license and Python version badges.
    • Rewrote the 'Features' section to detail the agentic ER phases.
    • Updated the 'Architecture' table to reflect new tooling like uv, PySpark 4.x, DSPy 3.x, Qwen3-Embedding, FAISS, and Ruff.
    • Revised 'Quick Start' and 'Development' sections to use uv and ruff commands.
    • Added 'Benchmark Results' and 'Project Structure' sections.
    • Updated 'References' and 'License' information.
  • config.yml
    • Expanded configuration with new sections for models (embedding, LLM, temperature), er (blocking, matching, evaluation, paths), and benchmarks (output directory, dataset definitions).
  • docs/BUILD_PLAN.md
    • Added a detailed build plan outlining the phased implementation of SERF, from infrastructure setup to PyPI preparation.
  • pyproject.toml
    • Migrated the project's build system from Poetry to Hatchling.
    • Updated project metadata including license, authors, classifiers, and URLs.
    • Revised dependencies to include dspy-ai, click, pyyaml, pyspark, sentence-transformers, faiss-cpu, cleanco, tqdm, numpy, pandas.
    • Updated development dependencies to pytest, pytest-asyncio, ruff, zuban, pre-commit, types-pyyaml.
    • Configured ruff for linting and formatting rules.
  • scripts/generate_benchmark_data.py
    • Added a new script to generate synthetic benchmark datasets (DBLP-ACM, Walmart-Amazon, DBLP-Scholar) in DeepMatcher format for testing.
  • scripts/run_benchmarks.py
    • Added a new script to execute the SERF pipeline on benchmark datasets, including embedding, FAISS blocking, and evaluation.
  • src/serf/analyze/init.py
    • Added the __init__.py file to define the analyze module and expose DatasetProfiler and detect_field_type.
  • src/serf/analyze/field_detection.py
    • Added a new module with detect_field_type function for inferring data types based on field names and values.
  • src/serf/analyze/profiler.py
    • Added a new DatasetProfiler class to analyze dataset characteristics, including completeness, uniqueness, and recommended ER fields.
  • src/serf/block/embeddings.py
    • Added a new EntityEmbedder class for generating entity embeddings using sentence-transformers with device auto-detection.
  • src/serf/block/faiss_blocker.py
    • Added a new FAISSBlocker class for clustering entity embeddings into blocks using FAISS IndexIVFFlat with auto-scaling capabilities.
  • src/serf/block/normalize.py
    • Added a new module for various name normalization utilities, including corporate suffix removal, acronym generation, and domain suffix stripping.
  • src/serf/block/pipeline.py
    • Added a new SemanticBlockingPipeline class to orchestrate the embedding, clustering, and splitting of entity blocks.
  • src/serf/cli/main.py
    • Expanded the CLI with new commands for analyze, block, match, eval, edges, resolve (full pipeline), benchmark, and download.
    • Updated existing command implementations to integrate with the new SERF modules.
  • src/serf/config.py
    • Refined type hints in the Config class, replacing Optional and Union with native Python type union syntax.
    • Updated exception handling to use from err for better traceback clarity.
  • src/serf/dspy/agents.py
    • Added a new ERAgent class that uses DSPy ReAct to control the entity resolution pipeline dynamically.
  • src/serf/dspy/baml_adapter.py
    • Modified format_field_structure to correctly iterate over input fields.
  • src/serf/dspy/signatures.py
    • Added new DSPy signatures (BlockMatch, EntityMerge, EdgeResolve, AnalyzeDataset) to define LLM input/output contracts for ER tasks.
  • src/serf/dspy/type_generator.py
    • Added a new module with entity_type_from_spark_schema to dynamically create Pydantic Entity subclasses from Spark schemas.
  • src/serf/dspy/types.py
    • Replaced previous BAML-generated types with new core Pydantic types (Entity, EntityBlock, MatchDecision, BlockResolution, FieldProfile, DatasetProfile, IterationMetrics, BlockingMetrics) for the SERF pipeline.
  • src/serf/edge/init.py
    • Added the __init__.py file to define the edge module and expose EdgeResolver.
  • src/serf/edge/resolver.py
    • Added a new EdgeResolver class for grouping and resolving duplicate edges using an LLM after entity merging.
  • src/serf/eval/init.py
    • Added the __init__.py file to define the eval module.
  • src/serf/eval/benchmarks.py
    • Added a new BenchmarkDataset class for managing, downloading, and converting standard ER benchmark datasets.
  • src/serf/eval/metrics.py
    • Added a new module for calculating standard entity resolution evaluation metrics like precision, recall, F1 score, and reduction ratio.
  • src/serf/match/init.py
    • Added the __init__.py file to define the match module and expose EntityMatcher, UUIDMapper, and few-shot example functions.
  • src/serf/match/few_shot.py
    • Added a new module for generating and formatting few-shot examples to guide LLM matching behavior.
  • src/serf/match/matcher.py
    • Added a new EntityMatcher class to resolve entity blocks using DSPy's BlockMatch signature, including UUID mapping and async processing.
  • src/serf/match/uuid_mapper.py
    • Added a new UUIDMapper class to handle the conversion of entity IDs to integers for LLM processing and their restoration.
  • src/serf/merge/init.py
    • Added the __init__.py file to define the merge module and expose EntityMerger.
  • src/serf/merge/merger.py
    • Added a new EntityMerger class for combining multiple entities into a single canonical record, prioritizing complete field values.
  • src/serf/spark/init.py
    • Added the __init__.py file to define the spark module.
  • src/serf/spark/graph.py
    • Added a new module with Spark graph algorithms, specifically for finding connected components.
  • src/serf/spark/iceberg.py
    • Added a new module for Iceberg integration, providing utilities for SparkSession configuration and table operations.
  • src/serf/spark/schemas.py
    • Added a new module for bridging Pydantic and Spark schemas, including type conversion and schema normalization utilities.
  • src/serf/spark/utils.py
    • Added a new module with shared Spark utilities, such as splitting large blocks and selecting most common properties.
  • tests/test_agents.py
    • Added new tests for the ERAgent class, covering initialization, signature fields, and tool function behavior.
  • tests/test_benchmarks.py
    • Added new tests for the BenchmarkDataset class, verifying dataset availability, creation, evaluation, and entity conversion.
  • tests/test_cli.py
    • Added new tests for the serf command-line interface, checking help messages and basic command functionality.
  • tests/test_dspy.py
    • Updated the lm fixture's type hint and docstring for clarity.
  • tests/test_edge_resolver.py
    • Added new tests for the EdgeResolver class, covering edge grouping and resolution logic.
  • tests/test_embeddings.py
    • Added new tests for the EntityEmbedder class and get_torch_device function.
  • tests/test_faiss_blocker.py
    • Added new tests for the FAISSBlocker class, verifying block creation, auto-scaling, and ID preservation.
  • tests/test_few_shot.py
    • Added new tests for few-shot example generation and formatting functions.
  • tests/test_field_detection.py
    • Added new tests for the detect_field_type function, covering various data types and heuristics.
  • tests/test_graph.py
    • Added new tests for the Spark connected_components function.
  • tests/test_merger.py
    • Added new tests for the EntityMerger class, verifying entity merging logic, source ID/UUID accumulation, and value selection.
  • tests/test_metrics.py
    • Added new tests for entity resolution evaluation metrics, ensuring correct calculation of precision, recall, and F1 score.
  • tests/test_normalize.py
    • Added new tests for name normalization functions, including handling of whitespace, punctuation, unicode, and corporate/domain suffixes.
  • tests/test_profiler.py
    • Added new tests for the DatasetProfiler class, covering dataset analysis and field recommendations.
  • tests/test_schemas.py
    • Added new tests for Spark schema utilities, including Pydantic-to-Spark conversion and schema validation.
  • tests/test_signatures.py
    • Added new tests for DSPy signature definitions, verifying input/output fields and predictor creation.
  • tests/test_type_generator.py
    • Added new tests for entity_type_from_spark_schema and spark_type_to_python functions.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This is a massive and impressive pull request that implements the core SERF entity resolution system. The migration to uv and ruff is a great modernization step. The new modules for blocking, matching, evaluation, and the comprehensive CLI are well-structured and follow good Python practices. My review focuses on a few key areas to improve robustness and maintainability: ensuring the CLI can handle various data inputs without crashing, correcting a potentially problematic assumption in the ID mapping logic, centralizing configuration, and a minor style fix in a script. Overall, this is a fantastic contribution that lays a solid foundation for the project.

]
entities.append(
Entity(
id=int(row_dict.get("id", idx)), # type: ignore[arg-type]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The conversion int(row_dict.get("id", idx)) assumes that if an id column exists, its values can be cast to an integer. This will raise a ValueError if the IDs are non-integer strings (e.g., UUIDs), causing the CLI to crash. It would be more robust to handle this potential ValueError.

Comment on lines +96 to +100
if missing_ids and resolution.resolved_entities:
first = resolution.resolved_entities[0]
existing_sources = set(first.source_ids or [])
first_sources = list(existing_sources | missing_ids)
resolution.resolved_entities[0] = first.model_copy(update={"source_ids": first_sources})

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The "Phase 1" recovery for missing entities in unmap_block assumes that any entity missing from the LLM's output should be considered merged into the first resolved entity. This is a strong assumption that could lead to incorrect data provenance, as the LLM might have dropped the entity for other reasons (e.g., context length). The "Phase 2" recovery, which re-adds the missing entity and marks it as skipped, is a much safer approach. I recommend removing Phase 1 to avoid incorrect source attribution.

Suggested change
if missing_ids and resolution.resolved_entities:
first = resolution.resolved_entities[0]
existing_sources = set(first.source_ids or [])
first_sources = list(existing_sources | missing_ids)
resolution.resolved_entities[0] = first.model_copy(update={"source_ids": first_sources})
# if missing_ids and resolution.resolved_entities:
# first = resolution.resolved_entities[0]
# existing_sources = set(first.source_ids or [])
# first_sources = list(existing_sources | missing_ids)
# resolution.resolved_entities[0] = first.model_copy(update={"source_ids": first_sources})

matched_right = [e for e in right_entities if e.id in gt_right_ids]
unmatched_right = [e for e in right_entities if e.id not in gt_right_ids]
sample_size = max(0, max_entities - len(matched_right))
import random

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The import random statement is located inside the run_benchmark function. According to PEP 8, imports should be placed at the top of the file. This improves readability and makes it easier to see the script's dependencies at a glance. Please move this import to the top-level of the script.

Comment on lines +18 to +49
DATASET_REGISTRY: dict[str, dict[str, str]] = {
"dblp-acm": {
"url": "https://dbs.uni-leipzig.de/files/datasets/DBLP-ACM.zip",
"table_a_name": "DBLP2.csv",
"table_b_name": "ACM.csv",
"mapping_name": "DBLP-ACM_perfectMapping.csv",
"mapping_col_a": "idDBLP",
"mapping_col_b": "idACM",
"domain": "bibliographic",
"difficulty": "easy",
},
"dblp-scholar": {
"url": "https://dbs.uni-leipzig.de/files/datasets/DBLP-Scholar.zip",
"table_a_name": "DBLP1.csv",
"table_b_name": "Scholar.csv",
"mapping_name": "DBLP-Scholar_perfectMapping.csv",
"mapping_col_a": "idDBLP",
"mapping_col_b": "idScholar",
"domain": "bibliographic",
"difficulty": "medium",
},
"abt-buy": {
"url": "https://dbs.uni-leipzig.de/files/datasets/Abt-Buy.zip",
"table_a_name": "Abt.csv",
"table_b_name": "Buy.csv",
"mapping_name": "abt_buy_perfectMapping.csv",
"mapping_col_a": "idAbt",
"mapping_col_b": "idBuy",
"domain": "products",
"difficulty": "hard",
},
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The DATASET_REGISTRY is hardcoded within this file. However, there is a benchmarks.datasets section in config.yml that seems to define the same information. This creates a discrepancy and a maintainability issue, as changes to benchmark datasets would need to be made in two places. To centralize configuration, this registry should be loaded from config.yml.

cursoragent and others added 17 commits March 8, 2026 16:27
…r F1=0.90 (embedding baseline)

Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
…flag, remove scripts/ directory

Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
…h optional YAML config

Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
…M for all matching

Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
…tput flag

Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
…th validation

Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
…fix analyze LLM guidance, add auto-convergence

Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
…arison test, fix analyze config generation

Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
…ig for agentic override

Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
…All matching via LLM.

Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
…hing

Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
…g (Gemini Flash, 30 concurrent)

Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
…se 1 recovery, use optional-dependencies, rename _resolve_blocks_with_llm, document FAISS type ignores

Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
… block splitting for small targets

Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
cursoragent and others added 14 commits March 8, 2026 22:01
…king

Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
- pipeline.py: Assign UUIDs at entity creation, track all historical UUIDs
- uuid_mapper.py: Transitive source_ids/source_uuids collection, dedup, exclude self
- merger.py: Dedup source_ids/source_uuids, exclude master's own ID/UUID
- matcher.py: Add iteration param to resolve_block/resolve_blocks, set match_skip_history
- metrics.py: Add validate_source_uuids function
- Tests for dedup, self-exclusion, transitive accumulation, and UUID validation

Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
…analysis, UUID validation, PASS/FAIL checks

Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
…ss-cpu

Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
… FAISS compatibility

Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
… processes to fix macOS MPS segfault

Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
…NING.md from Eridu lessons

Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
…ngual-e5-base, remove all pip references

Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
…P=0.885 R=0.581 F1=0.701

Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
…rvice profiles

Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
… defenses, validate LLM config output, deduplicate CLI helpers

Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants