Conversation
Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
…te config and module structure Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
…ipeline with tests Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
…ger with tests Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
…tion with tests Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
…s with tests Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
…, download commands Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
…s, fix text column detection Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
|
Cursor Agent can help with this pull request. Just |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces the complete initial implementation of the Semantic Entity Resolution Framework (SERF). It establishes the core architecture for agentic entity resolution, leveraging modern Python tooling and AI frameworks. The changes enable comprehensive data processing, intelligent matching, and robust evaluation, laying the groundwork for future enhancements and PyPI publication. Highlights
Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This is a massive and impressive pull request that implements the core SERF entity resolution system. The migration to uv and ruff is a great modernization step. The new modules for blocking, matching, evaluation, and the comprehensive CLI are well-structured and follow good Python practices. My review focuses on a few key areas to improve robustness and maintainability: ensuring the CLI can handle various data inputs without crashing, correcting a potentially problematic assumption in the ID mapping logic, centralizing configuration, and a minor style fix in a script. Overall, this is a fantastic contribution that lays a solid foundation for the project.
src/serf/cli/main.py
Outdated
| ] | ||
| entities.append( | ||
| Entity( | ||
| id=int(row_dict.get("id", idx)), # type: ignore[arg-type] |
There was a problem hiding this comment.
src/serf/match/uuid_mapper.py
Outdated
| if missing_ids and resolution.resolved_entities: | ||
| first = resolution.resolved_entities[0] | ||
| existing_sources = set(first.source_ids or []) | ||
| first_sources = list(existing_sources | missing_ids) | ||
| resolution.resolved_entities[0] = first.model_copy(update={"source_ids": first_sources}) |
There was a problem hiding this comment.
The "Phase 1" recovery for missing entities in unmap_block assumes that any entity missing from the LLM's output should be considered merged into the first resolved entity. This is a strong assumption that could lead to incorrect data provenance, as the LLM might have dropped the entity for other reasons (e.g., context length). The "Phase 2" recovery, which re-adds the missing entity and marks it as skipped, is a much safer approach. I recommend removing Phase 1 to avoid incorrect source attribution.
| if missing_ids and resolution.resolved_entities: | |
| first = resolution.resolved_entities[0] | |
| existing_sources = set(first.source_ids or []) | |
| first_sources = list(existing_sources | missing_ids) | |
| resolution.resolved_entities[0] = first.model_copy(update={"source_ids": first_sources}) | |
| # if missing_ids and resolution.resolved_entities: | |
| # first = resolution.resolved_entities[0] | |
| # existing_sources = set(first.source_ids or []) | |
| # first_sources = list(existing_sources | missing_ids) | |
| # resolution.resolved_entities[0] = first.model_copy(update={"source_ids": first_sources}) |
scripts/run_benchmarks.py
Outdated
| matched_right = [e for e in right_entities if e.id in gt_right_ids] | ||
| unmatched_right = [e for e in right_entities if e.id not in gt_right_ids] | ||
| sample_size = max(0, max_entities - len(matched_right)) | ||
| import random |
There was a problem hiding this comment.
| DATASET_REGISTRY: dict[str, dict[str, str]] = { | ||
| "dblp-acm": { | ||
| "url": "https://dbs.uni-leipzig.de/files/datasets/DBLP-ACM.zip", | ||
| "table_a_name": "DBLP2.csv", | ||
| "table_b_name": "ACM.csv", | ||
| "mapping_name": "DBLP-ACM_perfectMapping.csv", | ||
| "mapping_col_a": "idDBLP", | ||
| "mapping_col_b": "idACM", | ||
| "domain": "bibliographic", | ||
| "difficulty": "easy", | ||
| }, | ||
| "dblp-scholar": { | ||
| "url": "https://dbs.uni-leipzig.de/files/datasets/DBLP-Scholar.zip", | ||
| "table_a_name": "DBLP1.csv", | ||
| "table_b_name": "Scholar.csv", | ||
| "mapping_name": "DBLP-Scholar_perfectMapping.csv", | ||
| "mapping_col_a": "idDBLP", | ||
| "mapping_col_b": "idScholar", | ||
| "domain": "bibliographic", | ||
| "difficulty": "medium", | ||
| }, | ||
| "abt-buy": { | ||
| "url": "https://dbs.uni-leipzig.de/files/datasets/Abt-Buy.zip", | ||
| "table_a_name": "Abt.csv", | ||
| "table_b_name": "Buy.csv", | ||
| "mapping_name": "abt_buy_perfectMapping.csv", | ||
| "mapping_col_a": "idAbt", | ||
| "mapping_col_b": "idBuy", | ||
| "domain": "products", | ||
| "difficulty": "hard", | ||
| }, | ||
| } |
There was a problem hiding this comment.
The DATASET_REGISTRY is hardcoded within this file. However, there is a benchmarks.datasets section in config.yml that seems to define the same information. This creates a discrepancy and a maintainability issue, as changes to benchmark datasets would need to be made in two places. To centralize configuration, this registry should be loaded from config.yml.
…r F1=0.90 (embedding baseline) Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
…flag, remove scripts/ directory Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
…h optional YAML config Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
…M for all matching Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
…tput flag Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
…th validation Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
…fix analyze LLM guidance, add auto-convergence Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
…arison test, fix analyze config generation Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
…ig for agentic override Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
…All matching via LLM. Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
…hing Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
…g (Gemini Flash, 30 concurrent) Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
…se 1 recovery, use optional-dependencies, rename _resolve_blocks_with_llm, document FAISS type ignores Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
… block splitting for small targets Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
…king Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
- pipeline.py: Assign UUIDs at entity creation, track all historical UUIDs - uuid_mapper.py: Transitive source_ids/source_uuids collection, dedup, exclude self - merger.py: Dedup source_ids/source_uuids, exclude master's own ID/UUID - matcher.py: Add iteration param to resolve_block/resolve_blocks, set match_skip_history - metrics.py: Add validate_source_uuids function - Tests for dedup, self-exclusion, transitive accumulation, and UUID validation Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
…analysis, UUID validation, PASS/FAIL checks Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
…ss-cpu Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
… FAISS compatibility Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
… processes to fix macOS MPS segfault Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
…NING.md from Eridu lessons Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
…ngual-e5-base, remove all pip references Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
…P=0.885 R=0.581 F1=0.701 Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
…rvice profiles Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
… defenses, validate LLM config output, deduplicate CLI helpers Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
Implement the complete SERF entity resolution system and establish benchmark baselines to fulfill the long-shot plan and prepare for PyPI.
The system includes conversion to
uv, Ruff integration, core modules for blocking, matching, merging, evaluation, analysis, edge resolution, Spark integration, DSPy agents, and a CLI. Baseline F1 scores were established on DBLP-ACM, Abt-Buy, and DBLP-Scholar datasets using embedding similarity.