Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
41 commits
Select commit Hold shift + click to select a range
21414d2
Add build plan for SERF implementation
cursoragent Mar 8, 2026
61ffbbe
Convert from Poetry to uv, replace black/isort/flake8 with Ruff, upda…
cursoragent Mar 8, 2026
a4cfc4b
Add pipeline types, type generator, and DSPy signatures with tests
cursoragent Mar 8, 2026
ae87c7a
Add blocking module: embeddings, FAISS blocker, name normalization, p…
cursoragent Mar 8, 2026
998f491
Add matching and merging modules: UUID mapper, matcher, few-shot, mer…
cursoragent Mar 8, 2026
d4c5378
Add evaluation metrics, benchmarks, dataset analysis, and edge resolu…
cursoragent Mar 8, 2026
21c8c84
Add Spark integration (schemas, utils, iceberg, graph) and DSPy agent…
cursoragent Mar 8, 2026
9300375
Add complete CLI with analyze, block, match, eval, resolve, benchmark…
cursoragent Mar 8, 2026
68ec92b
Update benchmarks to use Leipzig dataset source, add benchmark script…
cursoragent Mar 8, 2026
fb9707f
Add README, LICENSE, PyPI packaging setup
cursoragent Mar 8, 2026
81c823e
Add benchmark results: DBLP-ACM F1=0.83, Abt-Buy F1=0.46, DBLP-Schola…
cursoragent Mar 8, 2026
ae8f8c7
Port scripts into CLI: add benchmark-all command, --use-llm/--no-llm …
cursoragent Mar 8, 2026
68d1ecf
Add serf run command for end-to-end ER on any CSV/Parquet/Iceberg wit…
cursoragent Mar 8, 2026
4070f39
Remove embedding-based matching: use embeddings only for blocking, LL…
cursoragent Mar 8, 2026
de0c251
Enhance serf analyze to generate LLM-powered ER config YAML with --ou…
cursoragent Mar 8, 2026
b3aaea1
Add integration tests: LLM-generated ER config from benchmark data wi…
cursoragent Mar 8, 2026
c9b3487
Add Publication/Product types, fix block sizes (target=30, max=100), …
cursoragent Mar 8, 2026
65d8876
Fix DSPy threading with dspy.context, add head-to-head benchmark comp…
cursoragent Mar 8, 2026
c09cd83
Default to name-only embedding for blocking, add blocking_fields conf…
cursoragent Mar 8, 2026
9685616
Add critical rule: embeddings for blocking only, never for matching. …
cursoragent Mar 8, 2026
13b95a1
Set max_tokens=8192 for LLM matcher to prevent output truncation
cursoragent Mar 8, 2026
ca166d7
Add --limit and --concurrency options, tqdm progress bar for LLM matc…
cursoragent Mar 8, 2026
b3f8cb0
Benchmark results: DBLP-ACM P=0.895 R=0.625 F1=0.736 with LLM matchin…
cursoragent Mar 8, 2026
782317c
Address Gemini code review: fix string ID crash, remove incorrect Pha…
cursoragent Mar 8, 2026
b079972
Auto-scale block size for --limit test runs: target=5 when limit<=20
cursoragent Mar 8, 2026
556d9dd
Extract predicted pairs from source_ids as well as matches, fix FAISS…
cursoragent Mar 8, 2026
32e7555
Convert async tests to use pytest-asyncio with @pytest.mark.asyncio
cursoragent Mar 8, 2026
e83ec35
Add SCALABILITY.md: vector engine recommendations for beyond-RAM bloc…
cursoragent Mar 8, 2026
39ff446
Add rigorous source ID and UUID tracking across pipeline
cursoragent Mar 8, 2026
bc27560
Port Abzu er_eval.py rigor: comprehensive evaluator with dedup, skip …
cursoragent Mar 8, 2026
3d7a3d0
Improve install instructions: add pip and conda paths, note about fai…
cursoragent Mar 8, 2026
d2bd6d0
Add pyspark-mcp dependency
cursoragent Mar 8, 2026
fe6d385
Fix FAISS segfault on macOS: force CPU encoding, contiguous array for…
cursoragent Mar 9, 2026
23193fb
Subprocess isolation for PyTorch/FAISS: embed and cluster in separate…
cursoragent Mar 9, 2026
490893d
Switch to intfloat/multilingual-e5-large embedding model, add FINE_TU…
cursoragent Mar 9, 2026
8a18631
Externalize all model names to config.yml, switch to intfloat/multili…
cursoragent Mar 9, 2026
18721eb
Document: never use pip or uv pip, only uv add/sync/run
cursoragent Mar 9, 2026
fb14ea5
Benchmark results with multilingual-e5-base + Gemini Flash: DBLP-ACM …
cursoragent Mar 9, 2026
c2bca43
Dockerize: Dockerfile on Ubuntu 24.04 with uv, docker-compose with se…
cursoragent Mar 9, 2026
a42d5fa
Address Gemini review round 2: fix ruff version, add prompt injection…
cursoragent Mar 9, 2026
9bde1a4
Add QUICKSTART.md: end-to-end guide for using SERF
cursoragent Mar 10, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 18 additions & 0 deletions .dockerignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
.venv/
venv/
__pycache__/
*.py[cod]
*.egg-info/
dist/
build/
.git/
.idea/
.vscode/
data/
logs/
*.swp
*.swo
.DS_Store
.claude/
.mcp.json
uv.lock
21 changes: 21 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,30 @@ logs
# Any data we store in the course of an ER pipeline
data

# Python
__pycache__/
*.py[cod]
*.egg-info/
dist/
build/
*.egg

# Virtual environments
.venv/
venv/

# uv
uv.lock

# Ignore Claude Code Settings and MCP setup
.mcp.json
.claude

# Ignore Mac crap
.DS_Store

# IDE
.idea/
.vscode/
*.swp
*.swo
38 changes: 8 additions & 30 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -1,42 +1,20 @@
repos:
- repo: local
hooks:
- id: black
name: black
entry: black
language: system
types: [python]
- repo: local
- repo: https://github.com/astral-sh/ruff-pre-commit
rev: v0.15.5
hooks:
- id: flake8
name: flake8
entry: flake8
language: system
types: [python]
- repo: local
hooks:
- id: isort
name: isort
entry: isort
language: system
types: [python]
- id: ruff
args: [--fix]
- id: ruff-format
- repo: local
hooks:
- id: zuban
name: zuban
entry: zuban check src/serf tests
language: system
types: [python]
- repo: local
hooks:
- id: pytest
name: pytest
entry: pytest
entry: uv run zuban check src/serf tests
language: system
types: [python]
# Prettier - formats Markdown (and other files)
pass_filenames: false
- repo: https://github.com/pre-commit/mirrors-prettier
rev: v3.1.0 # Use the latest version
rev: v3.1.0
hooks:
- id: prettier
types_or: [markdown]
Expand Down
53 changes: 24 additions & 29 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,15 +6,15 @@ This file provides guidance to Claude Code (claude.ai/code) when working with th

### Development

- Install Dependencies: `poetry install`
- Run CLI: `poetry run serf`
- Build/Generate serf/baml_client code: `baml-cli generate`
- Test baml_src code: `baml-cli test`
- Test all: `poetry run pytest tests/`
- Test single: `poetry run pytest tests/path_to_test.py::test_name`
- Lint: `pre-commit run --all-files`, `poetry run flake8 src tests`
- Format: `poetry run black src tests`, `poetry run isort src tests`
- Type check: `poetry run zuban check src tests`
- Install Dependencies: `uv sync`
- Run CLI: `uv run serf`
- Test all: `uv run pytest tests/`
- Test single: `uv run pytest tests/path_to_test.py::test_name`
- Lint: `uv run ruff check src tests`
- Format: `uv run ruff format src tests`
- Lint + Fix: `uv run ruff check --fix src tests`
- Type check: `uv run zuban check src tests`
- Pre-commit: `pre-commit run --all-files`

### Docker Development (via Taskfile)

Expand Down Expand Up @@ -62,12 +62,12 @@ This file provides guidance to Claude Code (claude.ai/code) when working with th
- KISS: KEEP IT SIMPLE STUPID. Do not over-engineer solutions. ESPECIALLY for Spark / PySpark.
- Line length: 100 characters
- Python version: 3.12
- Formatter: black with isort (profile=black)
- Formatter: Ruff (replaces black + isort + flake8)
- Types: Always use type annotations, warn on any return
- Imports: Use absolute imports, organize imports to be PEP compliant with isort (profile=black)
- Imports: Use absolute imports, organize imports with Ruff isort
- Error handling: Use specific exception types with logging
- Naming: snake_case for variables / functions, CamelCase for classes
- BAML: Use for LLM-related code, regenerate client with `baml-cli generate`
- DSPy: Use DSPy signatures for all LLM-related code
- Whitespaces: leave no trailing whitespaces, use 4 spaces for indentation, leave no whitespace on blank lines
- Blank lines: Do not indent any blank lines in Python files. Indent should be 0 for these lines. Indent to 0 spaces when replacing a line with a blank line.
- Strings: Use double quotes for strings, use f-strings for string interpolation
Expand All @@ -79,11 +79,11 @@ This file provides guidance to Claude Code (claude.ai/code) when working with th
- Type checking: Use zuban for type checking, run zuban before committing code. It is mypy compatible.
- Logging: Use logging for error handling, avoid print statements. Always use `from serf.logs import get_logger` and `logger = get_logger(__name__)`
- Documentation: Use Sphinx for documentation, include docstrings in all public functions/classes
- Code style: Follow PEP 8 for Python code style, use flake8 for linting
- Code style: Follow PEP 8 for Python code style, use Ruff for linting and formatting
- Zuban: Use zuban for type checking, run zuban before committing code. Configure it in `pyproject.toml`.
- Pre-commit: Use pre-commit for linting and formatting, configure it in `.pre-commit-config.yaml`
- Git: Use git for version control, commit often with clear messages, use branches for new features/bug fixes. Always test new features in the CLI before you commit them.
- Poetry: Use poetry for dependency management and packaging, configure it in `pyproject.toml`
- uv: Use uv for dependency management and packaging, configure it in `pyproject.toml`
- discord.py package - always use selective imports for `discord` - YES `from discord import x` - NO `import discord`
- Use `serf.config.Config` - use the `Config` class from `serf.config` which has an instance serf.config.config to access configuration values. Do not hardcode configuration values in the codebase. If you need to add a new configuration value, add it to the `config.yml` file and access it through the `Config` class's instance via `from serf.config import config` and `config.get(key)`.
- External strings - we store all strings in `config.yml` and use the serf.config.config instance to access them. Do not hardcode strings in the codebase. If you need to add a new string, add it to the config.yml file and access it through the Config class's instance via `from serf.config import config` and `config.get(key)`.
Expand All @@ -99,17 +99,14 @@ This file provides guidance to Claude Code (claude.ai/code) when working with th
- Help strings - never put the default option values in the help strings. The help strings should only describe what the option does, not what the default value is. The default values are already documented in the @config.yml file and will be printed via the `@click.command(context_settings={"show_default": True})` decorator of each Click command.
- Read the README - consult the README before taking action. The README contains information about the project and how to use it. If you need to add a new command or change an existing one, consult the README first.
- Update the README - if appropriate, update the README with any new commands or changes to existing commands. The README should always reflect the current state of the project.
- Use Poetry - use poetry for dependency management and packaging. Do not use pip or conda.
- Use BAML - use BAML for LLM-related code. Do not use any other libraries or frameworks for LLM-related code. BAML is an extension of Jinja2 and is used for templating LLM information extraction in this project. Use BAML to generate code for the BAML client and to process data.
- DO NOT WRITE TO the `serf.baml_client` module / @src/serf/baml_client directory. This directory is generated by the `baml-cli generate` command and should not be modified directly. Instead, use the `baml-cli generate` command to regenerate the client when needed.
- Use uv - use uv for dependency management and packaging. Do not use `pip`, `uv pip`, `conda`, or `poetry`. Use `uv add` to add dependencies, `uv sync` to install, `uv run` to execute. Never suggest `pip install` in code, docs, or error messages.
- Use DSPy - use DSPy signatures and modules for all LLM-related code. Use the BAMLAdapter for structured output formatting.
- Use PySpark for ETL - use PySpark for ETL and batch data processing to build our knowledge graph. Do not use any other libraries or frameworks for data processing. Use PySpark to take the output of our BAML client and transform it into a knowledge graph.
- PySpark - Do not break up dataflow into functions for loading, computing this, computing that, etc. Create a single function that performs the entire dataflow at hand. Do not check if columns exist, assume they do. Do not check if paths exist, assume they do. We prefer a more linear flow for Spark scripts and simple code over complexity. This only applies to Spark code.
- PySpark - assume the fields are present, don't handle missing fields unless I ask you to.
- PySpark - don't handle obscure edge cases, just implement the logic that I ask DIRECTLY.
- PySpark - SparkSessions should be created BELOW any imports. Do not create SparkSessions at the top of the file.
- Flake8 - fix flake8 errors without being asked and without my verification.
- Black - fix black errors without being asked and without my verification.
- Isort - fix isort errors without being asked and without my verification.
- Ruff - fix ruff lint and format errors without being asked and without my verification.
- Zuban - fix mypy errors without being asked and without my verification.
- Pre-commit - fix pre-commit errors without being asked and without my verification.
- New Modules - create a folder for a new module without being asked and without my verification.
Expand All @@ -121,15 +118,13 @@ This file provides guidance to Claude Code (claude.ai/code) when working with th
- I repeat, NEVER TALK ABOUT YOURSELF IN COMMIT MESSAGES. Do not put "Generated with [Claude Code](https://claude.ai/code)" or anything else relating to Claude or Anthropic in commit messages. Commit messages should only describe the code changes made, not the tool used to make them.
- Ask questions before mitigating a simple problem with a complex fix.

## Important Notes
## Critical Rules

### BAML Client Generation
### Embeddings Are For Blocking ONLY

The @src/serf/baml_client/ directory is auto-generated. Never edit files in this directory directly. To make changes:
**NEVER use embedding cosine similarity for entity matching.** Embeddings are used ONLY for semantic blocking (FAISS clustering to group similar entities into blocks). ALL matching decisions MUST go through an LLM via DSPy BlockMatch signatures. Do not write embedding-based matching code, do not write cosine similarity thresholding for match decisions, do not create an "embedding mode" for matching. The only matching mode is LLM matching.

1. Edit the BAML source files in @src/baml_src/
2. Run `baml-cli generate` to regenerate the client
3. Test with `baml-cli test`
## Important Notes

### Configuration Management

Expand Down Expand Up @@ -157,7 +152,7 @@ logger.error(f"Failed to process: {error}")
- Unit tests: Test individual functions/classes in isolation
- Integration tests: Test with real services (Redis, S3, etc.)
- Cache mode tests: Test different caching strategies
- BAML tests: Use `baml-cli test` for LLM extraction testing
- DSPy tests: Test DSPy signatures with mock LM calls

### Spark Development

Expand All @@ -181,8 +176,8 @@ In addition, when writing PySpark code:
### Python Dependencies

- Python 3.12 required
- Core packages: baml-py, dspy-ai, pyspark, sentence-transformers, transformers, pytorch
- Development tools: poetry, black, isort, flake8, zuban, pytest
- Core packages: dspy-ai, pyspark, sentence-transformers, faiss-cpu, click, pyyaml
- Development tools: uv, ruff, zuban, pytest
- See pyproject.toml for complete dependency list

### Environment Variables
Expand Down
49 changes: 49 additions & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
FROM ubuntu:24.04

LABEL maintainer="rjurney@graphlet.ai"
LABEL description="SERF: Agentic Semantic Entity Resolution Framework"

# Avoid interactive prompts
ENV DEBIAN_FRONTEND=noninteractive

# Install system dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
python3.12 \
python3.12-venv \
python3.12-dev \
curl \
git \
openjdk-21-jre-headless \
&& rm -rf /var/lib/apt/lists/*

# Set Java home for PySpark
ENV JAVA_HOME=/usr/lib/jvm/java-21-openjdk-amd64
ENV PATH="${JAVA_HOME}/bin:${PATH}"

# Install uv
COPY --from=ghcr.io/astral-sh/uv:latest /uv /usr/local/bin/uv

# Set up working directory
WORKDIR /app

# Copy dependency files first for layer caching
COPY pyproject.toml uv.lock* ./

# Install dependencies
RUN uv sync --extra dev --no-install-project

# Copy the rest of the project
COPY . .

# Install the project itself
RUN uv sync --extra dev

# Pre-download the embedding model so it's cached in the image
RUN uv run python -c "from sentence_transformers import SentenceTransformer; SentenceTransformer('intfloat/multilingual-e5-base')"

# Create data directories
RUN mkdir -p data/benchmarks logs

# Default entrypoint is the serf CLI
ENTRYPOINT ["uv", "run", "serf"]
CMD ["--help"]
Loading